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Abstract 

A general methodology is presented for the construction and effective use of control vari- 
ates for reversible MCMC samplers. The values of the coefficients of the optimal linear 
combination of the control variates are computed, and adaptive, consistent MCMC estima- 
tors are derived for these optimal coefficients. All methodological and asymptotic arguments 
are rigorously justified. Numerous MCMC simulation examples from Bayesian inference 
applications demonstrate that the resulting variance reduction can be quite dramatic. 

Keywords — Bayesian inference, log-linear models, mixtures of Normals, probit, threshold autoregres- 
sive models, variance reduction 



CM ■ 1 Introduction 

> , 

Markov chain Monte Carlo (MCMC) methods provide the facility to draw, in an asymptotic 
sense, a sequence of dependent samples from a very wide class of probability measures in any 
dimension. This facility - together with the tremendous increase of computer power in recent 
£ — ■ years - makes MCMC perhaps the main reason for the widespread use of Bayesian statistical 

modeling across the entire spectrum of quantitative scientific disciplines. 

This paper provides a firm methodological foundation for the construction and use of control 
variates for reversible MCMC samplers. Although popular in the standard Monte Carlo setting, 
control variates have received little attention in the MCMC literature. The proposed method- 
ology will be shown, in many instances, to reduce the variance of the resulting estimators quite 
dramatically. 

In the simplest Monte Carlo setting, when the goal is to compute the expected value of some 
function F evaluated on independent and identically distributed (i.i.d.) samples Xx,X^, ■ ■ ., the 
variance of the standard ergodic averages of the F(Xi) can be reduced by exploiting available 
zero-mean statistics. If there are one or more functions Ux, U2, ■ ■ ■ , f/fe - the control variates - 
for which it is known that the expected value of U(Xi) is equal to zero, then adding any linear 
combination, 0iU\{Xi) + 62U2{Xi) + - ■ ■+9i e U).(Xi), to the F{Xi) does not change the asymptotic 
mean of the corresponding ergodic averages. Moreover, if the best constant coefficients {0*} are 
used, then the variance of the estimates is no larger than before and often it is much smaller. The 
standard practice in this setting is to estimate the optimal {0j} adaptively, based on the same 
sequence of samples; see, e.g., Liu (2001), Givens and Hoeting (2005), Robert and Casella (2004) 
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for details. Because of the demonstrated effectiveness of this technique, in many important areas 
of application - e.g., in computational finance where Monte Carlo methods are a basic tool for 
the approximate computation of expectations, see Glasserman (2004) - a major research effort 
is devoted to the construction of effective control variates in specific applied problems. 

However, up to now little has been established in the way of extending the above methodology 
to estimators based on MCMC samples, at least in part due to the intrinsic difficulties presented 
by the Markovian structure. For example, Mengersen et al. (1999) comment that "control 
variates have been advertised early in the MCMC literature (see, e.g., Green and Han (1992)), 
but they are difficult to work with because the models are always different and their complexity 
is such that it is extremely challenging to derive a function with known expectation." Indeed, 
there are two fundamental difficulties; not only is it hard to find nontrivial functions with 
known expectation with respect to the stationary distribution of the chain, but also, even in 
cases where such functions are available, there is no effective way to obtain useful estimates of the 
corresponding optimal coefficients {0* }. The reason why this is a fundamentally difficult problem 
is that the MCMC variance of ergodic averages is intrinsically an infinite-dimensional object: 
It cannot be written in closed form as a function of the transition kernel and the stationary 
distribution of the chain. 

An early reference of variance reduction for Markov chain samplers is Green and Han (1992), 
who exploit an idea of Barone and Frigessi (1989) and construct antithetic variables that may 
achieve variance reduction in simple settings but they do not appear to be widely applicable. 
Andradottir et al. (1993) focus on finite state space chains, they observe that optimum variance 
reduction can be achieved via the solution of the associated Poisson equation (see Section 2.1), 
and they propose numerical algorithms for its solution. Rao-Blackwellisation has been suggested 
by Gelfand and Smith (1990) and by Robert and Casella (2004) as a way to reduce the variance 
of MCMC estimators. Also, Philippe and Robert (2001) investigated the use of Riemann sums 
as a variance reduction tool in MCMC algorithms. An interesting as well as natural control 
variate that has been used, mainly as a convergence diagnostic, by Fan et al. (2006), is the score 
statistic. Although Philippe and Robert (2001) mention that it can be used as a control variate, 
its practical utility has not been investigated. Atchade and Perron (2005) restrict attention 
to independent Metropolis samplers and provide an explicit formula for the construction of 
control variates based on adaptive estimators. Mira et al. (2003) note that a solution to the 
Poisson equation provides the optimum control variate and they attempt to solve it numerically. 
Hammer and Hakon (2008) construct control variates for general Metropolis-Hastings samplers 
by expanding the state space. To estimate the optimal coefficients {6*} they use the same 
formula that one obtains for control variates in i.i.d. Monte Carlo sampling, but such estimators 
are strictly suboptimal; they are briefly discussed in Section 2.3, where we also explore their 
efficiency. 

A more relevant, for our purposes, line of work is that initiated by Henderson (1997), who 
observed that, for any real- valued function G defined on the state space of a Markov chain {X n }, 
the function U(x) := G(x) — E[G{X n+ i)\X n = x] has zero mean with respect to the stationary 
distribution of the chain. Henderson (1997), like some of the other authors mentioned above, 
also notes that the best choice for the function G would be the solution of the associated Poisson 
equation, and proceeds to compute approximations of this solution for specific Markov chains, 
with particular emphasis on models arising in stochastic network theory. 

The gist of our approach is to adapt Henderson's idea for the construction of control variates, 
and use them in conjunction with a new, efficiently implementable and provably optimal esti- 
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mator for the coefficients {Oj} for reversible chains. The ability to estimate the {Oj} effectively 
makes these control variates practically relevant in the statistical MCMC context, and allows us 
to avoid having to compute analytical approximations to the solution of the underlying Poisson 
equation. Our estimator for {6*} is adaptive, in the sense that is based on the MCMC output, 
and it can be used after the sample is obtained, making its actual computation independent of 
the MCMC algorithm. 

This methodology not only generalizes the classical method of control variates to the MCMC 
setting, it also offers an important advantage: Unlike the case of independent sampling where 
control variates need to be found in an ad hoc manner depending on the specific problem at 
hand, here the control variates (as well as estimates for the corresponding optimal coefficients) 
come for free. The only requirement for the application of this method at the post-processing 
stage is the availability of a function G of the sampled parameters, together with its one-step 
conditional expectation, E[G(X n+ i)\X n = x\. As we show in numerous specific examples, these 
are often readily available; for example, the availability of such expectations is essentially a 
prerequisite for Gibbs sampling. 

For any one particular application, there is, of course, a plethora of functions G (and, conse- 
quently, of corresponding control variates U) that can be used, so an important consideration for 
the effectiveness of this methodology for variance reduction is the careful choice of these func- 
tions. This issue is addressed in detail; we provide numerous illustrative examples of estimation 
problems based on MCMC samplers, motivated primarily by Bayesian inference problems. These 
examples are chosen as representing different major classes of MCMC samplers commonly used 
in important applications. In each case, the ideas underlying the choice of the functions G are 
explained, and these choices are justified either rigorously or heuristically, in connection with 
the theoretical development we present. 

The examples we consider range from the simplest, illustrative samplers, to complex appli- 
cations of Bayesian models to real data. In all cases, the resulting variance reduction is very 
significant and often quite large: For all the MCMC-based ergodic estimators we consider, the 
use of control variates gives variances at least 30 times smaller, and often hundreds or thousands 
of times smaller. 

Presently we focus only on cases of reversible MCMC samplers for which the one-step condi- 
tional expectations, E[G{X n+ i)\X n = x], of one or more functions G are available analytically 
in closed form. MCMC algorithms with this property include a vast array of samplers commonly 
used in practical Bayesian inference problems. In the examples presented in Sections 3 and 6 be- 
low we outline the implementation details of our methodology for a representative subset of both 
simple and complex models. Since our estimators for {6*} are applicable to reversible chains, we 
employ random-scan instead of the usual systematic-scan Gibbs or Metropolis-within-Gibbs algo- 
rithms. We also investigate the behavior of our estimators on discrete state space, random-walk 
Metropolis-Hastings samplers, and on Metropolis-within-Gibbs samplers. Although, strictly 
speaking, our theoretical development does not necessarily require that conditional expectations 
E\G{X n+ i)\X n = x] be analytically available, almost all of the examples presented here do have 
that property, primarily for the sake of convenience and of clarity of exposition. Further ongoing 
work by Dellaportas et al. (2008) explores ways in which this same theory can be applied to 
arbitrary reversible MCMC samplers, including cases where one-step conditional expectations 
are unavailable. 

As mentioned above, Henderson (1997) takes a different path toward optimizing the use of 
control variates for Markov chain samplers. Considering primarily continuous-time Markov pro- 
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cesses, an approximation G for the solution to the associated Poisson equation is derived from 
the so-called "heavy traffic" or "fluid model" approximations of the original process. The moti- 
vation and application of this method is primarily related to examples from stochastic networks 
and queueing theory. Closely related approaches are presented by Henderson and Glynn (2002) 
and Henderson et al. (2003), where the effectiveness of network control policies of multiclass net- 
works is evaluated via Markovian simulation tools. There, control variates are used for variance 
reduction, and the optimal parameters {9*} are estimated via an adaptive, stochastic gradient al- 
gorithm. General convergence properties of ergodic estimators using control variates are derived 
by Henderson and Simon (2004), in the case when the solution to the Poisson equation (either 
for the original chain or for an approximating chain) is known explicitly. Kim and Henderson 
(2007) introduce two related adaptive methods for tuning non-linear versions of the parameters 
{9j}, when using families of control variates that naturally admit a non-linear parameterization. 
After deriving asymptotic properties for these estimators, they present numerical examples for 
a simulation problem related to pricing derivative instruments in computational finance. In the 
case when the control variate U is defined in terms of a function G that can be taken as a 
Lyapunov function for the chain {X n }, Meyn (2006) derives precise exponential asymptotics for 
the performance of estimators employing such control variates. 

The rest of the paper is organized as follows. Section 2.1 gives the basic definitions that 
will remain in effect throughout the paper, and motivates the construction of control variates in 
connection with the Poisson equation. Sections 2.2, 2.3 and 2.4, building on ideas of Henderson 
(1997), illustrate the use of naive estimators of the optimal coefficient for a single control variate, 
and develop the theory for two new estimators for reversible chains. In Section 3 we investigate 
the impact of these estimators on variance reduction in five small MCMC examples, which are 
representative of a larger class of Bayesian inference problems. Section 4 discusses the effect 
of our estimators on bias reduction, compares the two estimators and advocates the use of 
one of them for general purposes. These estimators are generalized in Section 5 to the case of 
multiple control variates. Four more complex Bayesian inference problems that are implemented 
via MCMC are visited in Section 6; guidelines for constructing appropriate control variates are 
given, and their effects on variance reduction are illustrated. Finally, we provide theoretical 
justifications of our asymptotic arguments in Section 7 and conclude with a short discussion of 
possible further extensions in Section 8. 

2 Control Variates for Markov Chains 
2.1 The setting 

Suppose {A ra } is a discrete-time Markov chain with initial state Xq = x, taking values in the 
state space X with an associated c-algebra B. In typical applications, X will often be a (Borcl 
measurable) subset of M. d together the collection B of all its (Borel) measurable subsets. [More 
precise definitions and detailed assumptions will be given in Section 7.] The distribution of 
is described by its transition kernel, P(x, dy), 

P(x,A) := Pr{X k+1 G A\X k = x}, xeX,AeB. (1) 

It is well known that in many applications where it is desirable to compute the expectation 
E 7r (F) := tt(F) := f F dir of some function F : X — > R with respect to some probability measure 
7r on (X,£>), it turns out that, although the direct computation of n(F) is impossible or we 
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cannot even produce samples from tt, we can construct an easy-to-simulate Markov chain 
which has tt as its unique invariant measure. Under appropriate conditions, the distribution of 
{X n } converges to tt, a fact which can be made precise in several ways. For example, writing 
PF for the function, 

PF(x) := E X [F{X 1 )\ := E[F(X 1 ) \X =x], x £ X, 

we have that, for any initial condition x, 

P n F{x) := E[F(X n ) \X = x]^ tt(F), as n -»• oo, 

for an appropriate class of functions F : X — > R. Furthermore, the rate of this convergence can 
be quantified by the function, 

oo 

F{x) = £ [ pn n*) - <F)] , (2) 

n=0 

where F is easily seen to satisfy the Poisson equation for F, namely, 

PF — F = —F + tt(F). (3) 

To see that, at least formally, simply apply P to both sides of (2) and note that the resulting 
series for PF — F becomes telescoping and simplifies to — F + n(F). 

The above results describe how the distribution of X n converges to tt. In terms of estimation, 
the quantities of interest are the ergodic averages, 



i=0 

Again, under appropriate conditions the ergodic theorem holds, 

H n (F) — >■ ir(F), a.s., as n — > oo, (4) 

for an appropriate class of functions F. Moreover, the rate of this convergence is quantified by 
an associated central limit theorem, which states that, 

n-l 



M^n(F) - tt(F)) = -L J}F(X0 - tt(F)] AJV(0,4), asn^oo, 
vn i=0 

where a 2 F , the asymptotic variance of F, is given by, 

j n—l oo 

a 2 F := lim Var w (vV„(F)) = Hm Var J— V F(Xij) = V Cov w (F(X ), F(X n )). 

n— >oo n— >-oo \\/Tl / -* ' 



i=0 

Alternatively, it can be expressed in terms of the solution F to Poisson's equation as, 

<% = «(F*-{PFf). (5) 
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The results in equations (2) and (5) clearly indicate that it is useful to be able to compute the 
solution F to the Poisson equation for F. In general this is a highly nontrivial - or impossible 
- task; for one thing, it requires knowledge of the mean tt(F). The following example is one of 
the rare cases where explicit computations are possible. 

Suppose {X ra } is a discrete time version of the Ornstein-Uhlenbeck process defined by, Xq = x 
and X n+ i = aX n + Z n , where a is a constant in (0, 1) and {Z n } are independent and identically 
distributed (i.i.d.) standard Normal random variables. Standard methods easily show that the 
distribution of X n converges to 7r := N(0, (1 — a 2 ) -1 ), so if we take F(x) = x, then, /U ra (F) — > 
ir(F) = a.s., as n — > oo. Moreover, the central limit theorem implies that, 



V^n(F) AiY(0,CT 2 ) 
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OO, 



where a 2 = a 2 F is given by (5). In order to compute the variance we need to know F. As a first 
guess, we take G{x) = cx + b and compute, 

PG{x) - G(x) = E[cX 1 + b\X = x]-cx-b = E[c{ax + Z x )\ - cx = -c(l - a)x. 

-x, we need c = (1 — a) -1 ; any b will do. Taking, for 



For this to be equal to —F{x) + n(F) 
simplicity, b = 0, yields, 



F(x) = 



x 



and PF(x) = 



ax 



1 — a 1 — a ' 

Therefore, writing W for N(0, (1 — a 2 )" 1 ) random variable, 

W 2 a 2 W 2 



x e 



^ = 7T ^f 2 -(pf) 2 )=e[-^- 



(l-a) 2 (1-q) 2 J (1-a 



2.2 Control variates 



Suppose that, for some Markov chain {X n } with transition kernel P and invariant measure ir, 
we use the ergodic averages fJ. n (F) as in (4) to estimate the mean tt(F) of some function F 
under n. In many applications, although the estimates H n {F) converge to vr(F) as n — > oo, the 
associated asymptotic variance o 2 F is large and the convergence is very slow. 

In order to reduce the variance, we employ the idea of using control variates, as in the case of 
simple Monte Carlo with i.i.d. samples; see, for example, the standard texts Robert and Casella 
(2004); Liu (2001); Givens and Hoeting (2005), or the paper by Glynn and Szechtman (2002) 
for extensive discussions. Given a function U : X — > R for which we know that ir(U) = 0, define, 

F e = F - 9U, (6) 

and consider the modified estimators, 

Hn(F 9 ) = fi n {F) - 6fi n (U). 

We will concentrate exclusively on the the following class of functions U proposed by Henderson 
(1997). For an arbitrary G : X ->• R with vr(|G|) < oo, define, 



U = G-PG. 
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The invariance of it under P and the integrability of G immediately imply that ir(U) = 0. [See 
Section 7 for the details, complete assumptions, and full, rigorous results corresponding to this 
discussion.] Therefore, the ergodic theorem guarantees that the {fi n (Fg)} are consistent with 
probability one, and it is natural to seek particular choices for U and 9 so that the asymptotic 
variance o 2 F of the modified estimators is significantly smaller that the variance op of the 
standard ergodic averages fj, n (F). 

Suppose, at first, that we have complete freedom in the choice of G, so that we may set 9 = 1 
without loss of generality. Then we wish to make the asymptotic variance of, 

F-U = F-G + PG, 

as small as possible. But, in view of the Poisson equation (3), we see that the choice G = F 
yields, 

F-U = F- F + PF = ir(F), (7) 

which has zero variance. Therefore, our first rule of thumb for choosing G is: 

Choose a control variate U = G — PG with G « F. 

As mentioned above, it is typically impossible to compute F for realistic models in applications. 
But it is often possible to come up with a guess G that approximates F, or at least some G for 
which heuristics indicate that it would be useful as a control variate. Once such a function is 
selected, we form the modified estimators fi n (Fg) with respect to the function Fg as in (6), 

Fg = F — 6U = F — 6G + 6PG. 

The next task is to choose 9 so that the resulting variance, 

^=4 e = ^!-(M) 2 ), 

is minimized. Note that, from the definitions, 

U = G and Fg = F — 6G. (8) 

Therefore, 

a 2 = tt((F - 9G) 2 } - ti{{PF - 9PG) 2 } . 
Expanding the above expression as a quadratic in 9, the optimal value for 6 is determined as, 

n(FG-(PF)(PG)) 

ir(G 2 -(PG) 2 ) ' K) 

Note that, since U = G, the denominator is simply afj. Once again, this expression depends on 
F, so it is not immediately clear how to estimate 9* directly from the data {X ra }. We consider 
the issue of estimating 9* in detail below, but first let us interpret the optimal value of 9*. 
Starting from the expression, 

1 n— 1 

a 2 = lim VarJ-p V[FpQ) - 9U(X t )}), 

rn-oo V ,/n ^ — ' / 

v i=0 
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simple calculations lead to, 

oo 

a 2 e =a 2 F + 9 2 a 2 u -29 £ Cov n {F(X ), U(X n )), 

n=—oo 

so that 9* can also be expressed as 

9 * = — E CoY 7T (F(X ),U(X n )), 

(J - - 



(10) 



u 



leading to the optimal asymptotic variance, 



^-2 

o> = Up 



^ n=— oc 



Co V7r (F(X ),C/(X n )) 



Therefore, in order to reduce the variance, we want to have the covariance between F and U to 
be as large as possible. This leads to our second rule of thumb for selecting control variates: 

Choose a control variate U = G — PG so that U and F are highly correlated. 

Incidently, note that, since the denominator of (9) equals afj, comparing the expressions for 9* 
in (9) and (11) we see that, 



Cov n (F(X ),U(X n )) = tt(FG - (PF)(PG)). 



(12) 



Moreover, the fact that is always nonnegative, suggests that there should be a way to rewrite 
the expression tt(G 2 — (PG) 2 ) in the denominator of 9* in a way which makes this nonnegativity 
obvious. Indeed: 

Lemma 1. The asymptotic variance afj of the function U = G — PG can be expressed as, 



a 2 ; = vr(G 2 - {PG) 2 ) = E 7T [^G(X 1 ) - PG{X j)' 
Proof. Starting from the right-hand side of (13), 



(13) 



Err 



(G(X 1 ) - PG(X )) 2 ] = tt{G 2 )-2E^[G{X 1 )PG(X () )} + t,({PG) 2 ) 



Xn 



} + n((PG) 2 ) 



= tt(G 2 ) -2E tc {e GiX^PGiXo 

= tt(G 2 ) - 2E W [e[G(X!) I X }PG(X )] + ir((PG) 2 ) 
= tt(G 2 - (PG) 2 ), 

as claimed. The fact that afj = tt(G 2 — (PG) 2 ) is immediate upon noting that U = G. 
In view of Lemma I, 9* can also be expressed as, 

ir(FG - (PF)(PG)) 



9* 



Ejv 



(G(X 1 )-PG(X )y 



n 



(14) 
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2.3 A suboptimal empirical estimate of 9* 

Let {^n} be a Markov chain with transition kernel P and invariant measure tt. In order to 
estimate the mean tt(F) of some function F under n, we replace the ergodic averages p n (F) of 
(4) by the modified estimates p n {Fe) = ^n(F) — 9p n (U), where the control variate U := G — PG 
for some fixed function G, which, we hope, approximates the solution F to the Poisson equation 
for F, or, at least, is strongly correlated with F. In order to select a "good" value for the 
coefficient 9 - a value that leads to a relatively small asymptotic variance for the estimates 
Hn{Fg) - we first consider the following simplistic scheme. 

Pretending momentarily that {X n } is a sequence of i.i.d. samples with distribution it, then 
F = F and the optimal coefficient choice for 9 becomes, 

_ Cov n (F,G) _ Cov n (F,U) 
iid Var^(G) Var n (U) ' 



which can be adaptively estimated by, 



, _ Vn(FU) 
^ ~ p n {U 2 ) " 



This leads us to the usual adaptive estimator for tt(F), commonly used in the case of i.i.d. 
samples, 

lln{FU) TT \ , m p n (U)p n (FU) 



^nMd(F) := »n(Fo n .J =Hn(F- {u2 l U ) = Vn{F) 



Pn(U 2 ) J ^ ^ Mt/ 2 ) ' 

To examine its performance when used on samples from a Markov chain, we consider an example. 

Example 1. A simple Gibbs sampler. Let n(x,y) be a bivariate Normal distribution with 
zero mean, unit variances, and covariance p > 0. We use the systematic-scan Gibbs sampler to 
simulate from tt. Starting from arbitrary Xq = x and Yq = y, X\ is generated by sampling from 
n(x\y) ~ N(py, (1 — p 2 )), and then Y\ is generated by sampling from n(y\Xi) ~ N(pX\, (1 — p 2 )). 
Continuing this way produces a Markov chain {(X n ,Y n )} with distribution converging to tt. 

Suppose we wish to estimate the expected value of X 2 under tt. Letting F(x,y) = x 2 , the 
standard estimates p n {F) — > ir(F) = E^{X 2 ) = 1 a.s., but when p is close to 1 the variance is 
high and the convergence very slow. In this particularly simple example, we can actually solve 
the Poisson equation for F. Since F is quadratic, we consider a candidate solution of the form 
G(x,y) = bx 2 + cy 2 . Direct calculation shows that, 

PG(x, y) - G(x, y) = -bx 2 + (p 4 c + p 2 b - c)y 2 + (p 2 c + b + c)(l-p 2 ), 

and for this to be identically equal to —F(x, y) + n(F) = —x 2 + 1, it suffices to take 6 = 1 and 
c = p 2 (l — p 4 ) -1 . Therefore, 

F(x,y) = x 2 + p 2 (l-p 4 )- 1 y 2 - 

From this we can compute the asymptotic variance <j 2 f of the estimates p n (F) by substituting 
F in (5), to obtain, 

4 = E n [(X 2 + cY 2 ) 2 - (1 + cY 2 ) 2 ] = 2(1 + p 4 )(l - p 4 )- 1 , 
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which is indeed high for p ~ 1. 

And now suppose that, as is typically the case in applications, 7r(x, y) is not available and we 
cannot obtain an explicit solution for F. In order to create a control variate U it is natural to 
start with G = F itself, since we certainly expect that U = F — PF will be strongly correlated 
with F. But F only depends on x, so, in order to take advantage of the fact that we also produce 
samples for y, we let G(x, y) = F(x, y) + F(y, x) = x 2 + y 2 and define the control variate, 

U(x, y) = G(x, y) - PG(x, y) = x 2 + (1 - p 2 - p 4 )y 2 - (2 - p 2 - p 4 ). 

We will now compare the performance of three estimators: (i) The standard estimator p n (F); 
(ii) The suboptimal adaptive estimator p n ,nd(F) based on the control variate U defined above; 
and (in) The optimal estimator p n (Fo*) based on the same control variate U, but with respect 
to the optimal value of 9*. 

Since in this case we know both n and F explicitly, for the sake of comparison we compute 
the theoretically optimum value of 9 appearing in (9) as, 

a*_ 1 + 3p 2 + 2p 4 

(l-/> 4 )(2 + V + 3/ + P 6 )" 

Figure 1 shows a typical realization of the performance of all three estimators for the following 
parameter values: The correlation p = 0.95, the number of steps n = 5000, the initial values are 
x o = Vo = 0> an d the optimal value of 9* 3.273. In this experiment, the (estimated) variance 
of the adaptive estimator is smaller than that of the standard estimator by a factor of ~ 3.13; 
whereas the variance of the optimal estimator is smaller than that of the standard estimator by 
a factor of ~ 18.52. 




Figure 1: The sequence of the standard ergodic averages is shown as a solid blue line; the suboptimal adaptive 
estimates [i n ,na(F) as red "+" signs; and the optimal estimates fj,„(Fg*) are green "*" signs. For visual clarity, 
the estimates fJ. n ,ad(F) and fi n (Fg») are plotted only every 100 simulation steps. 
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The reduction in the variance was computed from T = 100 independent repetitions of the 
same experiment. For fi n (F), we obtained T = 100 different estimates n$(F), for i = 1,2, ... ,T, 
and the variance of n n (F) was estimated by, 

Y^jy^iF)-nniF)?, (15) 

i=i 

where ji n {F) is the average of the n { n\F). The same procedure was applied to estimate the 
variance of fJ. n ,ad(F) and n n (Fg*). The factors by which the variance of fJ> n (F) is larger than 
that of Hn,n&{F) and [i n {FQ* ), respectively, are shown in Table 1. 



Variance reduction factors 




Simulation steps 


Estimator 


n = 1000 


n = 2000 


n = 5000 


n = 10000 




3.16 


2.96 


3.13 


3.01 


Hn{Fe*) 


18.28 


16.43 


18.52 


16.07 



Table 1: Estimated factors by which the variance of fJ> n (F) is larger than that of fi„,nd(F) and /i„(Je*), respectively, 
after n = 1000, 2000, 5000 and 10000 simulation steps. 

For different values of the number of iterations n, the corresponding variance reduction 
factors were computed based on independent runs, and are not continuations of shorter runs. 
Note that the adaptive estimates fJ> n ,Hd(F) were actually computed in two steps: First the 
value for the coefficient 6 n ^ was computed, and then the values [i n ,ud{F) were calculated. In 
both passes, the same simulation samples were used. We emphasize that this procedure is used 
throughout the paper. Indeed, the fact that the estimators can be computed after the MCMC 
sample has been obtained is a major advantage of our methodology. 

Clearly, although the adaptive estimator [i n ,\\d{F) does offer a significant advantage over 
Hn{F), there is a lot to be gained from obtaining more accurate estimates of the optimal coef- 
ficient 9*. We remark that, instead of treating the samples {^ n } as being i.i.d., more accurate 
estimates for 9* can be obtained by approximating the expression (10) via averages over blocks. 
Nevertheless, extensive simulation experiments clearly indicate that the corresponding estima- 
tion gains are usually negligible, while the optimal, consistent estimation procedures for 9* given 
in the following section make a very significant difference. 

2.4 Optimal empirical estimates of 9* for reversible chains 

Let A = P — I denote the generator of a discrete time Markov chain {X ra } with transition kernel 
P. If the chain is reversible, then A is a self-adjoint linear operator on the space L2(tt). This 
simply means that, 

tt(FAG) = 7r(AF G), 

for any two functions F,GG -^(tt)- Our central result in terms of the estimation methodology 
is the observation that, in this case, the optimal coefficient 9* admits a representation that does 
not involve the solution to Poisson's equation F: 
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Proposition 1. If the chain {X n } is reversible, then the optimal coefficient 6* for the control 
variate U = G — PG can be expressed as, 



or, alternatively, 



r = r . n((F-n(F))(G + PG)) 

rcv ' vr(G 2 -(PG) 2 ) ' { ' 



*((F-ir(F))(G + PG)) 
^rev •- ~r~, T2T- K L( ) 



(G(X 1 ) - PG(X )y 



Proof. Let F = F — ir(F) denote the centered version of F, and recall that F solves Poisson's 
equation for F, so PF = F — F. Therefore, the numerator in the expression for 9* in (9) can 
be expressed as, 

tt(FG - (PF)(PG)) = ir(FG - (F - F)(PG)) 

= ir(FPG - FAG) 
= tt(FPG-AFG) 
= tt(FPG + FG) 
= tt(F(G + PG)). 

This proves (16), and (17) follows from (14). □ 
The expressions (16) and (17) immediately suggest estimating 6* via, 

Hn(F(G + PG)) - fi n (F)n n (G + PG) 



7 n,rev,l 



Vn(G 2 ) - fl n ((PG) 



2~> 



& Vn{F{G + PG)) - n n {F)fx n {G + PG) 

or n,rcv ' 2 " ^(Gix^-PGix^r • 

The resulting estimators, Un(Fz ) and u n (Fn ) for tt(F) based on the control variate 

"n,rcv,l Pn,rev,2 

U = G — PG and the coefficients ^ n ,rev,i and f? n ,rev,2, respectively, are denoted, 

Hn,rev,l(F) := Hn{Fa ) = ^n{F — O nrcv iU) 

' ' p n,rcv,l it' 

and Hn,rev,2( F ) : = ^(Fg _ J = Hn(F - n>Tev , 2 U). 



An alternative way for estimating 6* adaptively, which also applies to non-reversible chains, 
was recently developed in Meyn (2007), based on the "temporal difference learning" algorithm. 
As most of the chains we will consider are reversible and this alternative method is computation- 
ally significantly more expensive than our estimates ^ n ,rev,i and 6 n ,rev,2, it will not be considered 
further in the present discussion. 

A slightly more general case of the earlier example with a bivariate Gaussian density is 
considered below; the random-scan Gibbs sampler is used to examine the performance of the 
two new estimators. We note that although the systematic-scan Gibbs sampler in general does 
not produce a reversible chain, the random-scan Gibbs sampler always does. Also, the back- 
and-forth version of the systematic-scan Gibbs sampler is reversible, see Roberts (1992). 
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Example 2. The bivariate Gaussian through the random-scan Gibbs sampler. Let 

(X, Y) ~ 7r(x, y) be an arbitrary bivariate Normal distribution, where, without loss of generality, 
we take the expected values of both X and Y to be zero and the variance of X to be equal to one. 
Let Var(Y) = r 2 and the covariance E(XY) = pr for some p G (—1, 1)- Given arbitrary initial 
values xo = x and yo = y, the Gibbs sampler selects one of the two co-ordinates at random, 
and either updates y by sampling from 7r(y\x) ~ N(prx, r 2 (l — p 2 )), or x from N(^y, 1 — p 2 ). 
Continuing this way produces a reversible Markov chain {(X n , Y n )} with distribution converging 

to 7T. 

To estimate the expected value of X under n, we let F(x, y) = x and G(x, y) = x + y, so 
that, 

PG(x,y) = ±(l + P T)x + l(l + ^)y, 

and the control variate U = G — PG is, U (x, y) = G(x, y) — PG(x, y) = ~(1 — pr)x + 1(1 — £)y. 
We will compare the performance of five estimators: (i) The standard estimator p n (F); (ii) The 
suboptimal adaptive estimator p-n^idiP) based on the control variate U = G — PG defined above; 
(iii,iv) The two adaptive estimators p n .rcv,i{P) and p n ,rev,2(P) for the same control variate U; 
(v) The optimal estimator p n (Fg*) based on the same control variate, but with respect to the 
optimal value of 9*. 

In Figure 2 we plot the results of all five estimators, applied to a typical execution of the 
Gibbs sampler with n = 5000 steps and initial values xq = yo = 0.1. The problem parameter 
values are p = 0.99 and r 2 = 10. 




500 1000 1500 2000 2500 3000 3500 4000 4500 



Figure 2: The sequence of the standard ergodic averages /-i n (F) is shown as a solid blue line; the suboptimal 
adaptive estimates ^i„,iid(-F) as red "+" signs; the optimal adaptive estimates ^„, rC v,i(-F ) as bold magenta dots, 
Pn,icv,2{F) as a dashed cyan line, and the estimates p n {Fg*) corresponding to the optimal value of 8* as green 
"*" signs. For visual clarity, all estimates except fi n {F) are plotted only every 100 simulation steps. 

While the optimal estimator p n (FQ*) offers an obviously large advantage over the standard 
estimates p n (F), the improvement of the suboptimal estimator ^ n ,iid(-^) is rather insignificant. 
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The adaptive estimators fJ- n ,rev,i(F) and (i n ,rcv,2(F) are similarly very effective, and their per- 
formance is fairly close to that of the optimal fi n (Fg*). As in Example 1, we compute the factor 
by which each of these methods reduces the variance of the standard estimates n n (F), using 
T = 200 independent repetitions of the same experiment; recall equation (15) above. The results 
are shown in Table 2. 



Variance reduction factors 




Simulation steps 


Estimator 


n = 1000 


n = 5000 


n = 10000 


n = 50000 


n = 100000 


/Viid(-F) 


1.04 


1.03 


1.02 


1.02 


1.02 


Mn,rev,l(-^') 


2.14 


6.25 


6.77 


8.26 


7.50 


Mn,rev,2(-P 1 ) 


2.79 


5.66 


6.58 


8.19 


7.54 




5.23 


9.12 


8.20 


8.25 


7.53 



Table 2: Estimated factors by which the variance of /it n (F) is larger than that of Mn,iid(-F), f*i n ,rcv,i(F) fJ, n ,Tev,2(F) 
and )i n (Fo'), respectively, after n = 1000, 5000, 10000, 50000 and 100000 simulation steps. 

Clearly, both adaptive estimators fi n ,rev,i(F), A*n,rev,2 (F) perform very well, and their results 
are reasonably close to those of the optimal estimator /j, n (Fg*). A natural way to attempt to 
obtain an even greater improvement in terms of variance reduction would be to consider a control 
variate U based on a G of the form G(x, y) = ax + by, for coefficients a ^ b. But there is no 
obvious choice for the relationship between these two coefficients, and also we do not want to 
develop methods that are too model-specific. A generic way to address such problems is to 
consider two control variates U±, U2, based on the two different functions G\(x,y) = x and 
G2(x,y) = y. The corresponding methodology for such cases is developed in Section 5, where 
we also revisit this example. 

Another well-known difficulty with the standard estimates in this example (in addition to 
their high variance) is that they converge very slowly when the initial values of the Gibbs sampler 
are far from their mean. The above results are from simulations with x$ = yo = 0.1, and we 
also run several examples with different initial values. In those cases we found that a lot of the 
time the estimators /J- n ,rev,i(F) and n n ,rev,2{F) not only reduced the variance, but also greatly 
improved the bias. An example with initial values xq = 4 and yo = 12 is shown in Figure 3. 
A more detailed discussion of this issue will be given in Section 4, where we also address the 
question of choosing between the two adaptive estimators, (i n ,rev,i(F) and fJ> n ,rcv,2(F). 

3 Five Simple MCMC Examples 

Below we present five examples more closely motivated by problems in statistical inference. 
Among the vast array of simple MCMC samplers that can be used for illustration purposes, 
we have chosen a set of representative examples that cover a broad class of real applications. 
The Gaussian-Gamma posterior in Example 3, as well as the the bivariate Gaussian density of 
Example 2, are representative of the large class of normal hierarchical models that are analyzed 
in Gelfand et al. (1990). Similarly, the Gibbs sampler of Example 4 is seen as a simplistic version 
of a wide class of models that include discrete variables as latent variable indicators or model 
indices. The discrete state space random-walk Metropolis algorithm in Example 5 is used in 
model search algorithms in which analytical or approximate integration of all model parameters 
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200 400 600 800 1000 1200 1400 1600 1800 2000 

Figure 3: The sequence of the standard ergodic averages is shown as a solid blue line; the adaptive estimates 
Mn,rcv,i(i ? ), plotted only every 50 simulation steps, are shown as bold magenta dots, and the adaptive estimates 
Mn,rcv,2(i ? ) as a dashed cyan line. 

is first performed; see for example Clyde and George (2004). A simple version of a finite-mixtures 
mode of Normals is explored in detail in Example 6. This this class of models has been, and 
still is, one of the most challenging inference problems. Finally, we illustrate our methodology 
in the case of a "difficult" model where Cauchy priors result in heavy-tailed posterior densities; 
such densities are commonly met in, for example, spatial statistics; see Dellaportas and Roberts 
(2003) for an illustrative example. 

Example 3. A Gaussian-Gamma posterior. First we consider an example of the random- 
scan Gibbs sampler applied to simple Bayesian inference problem. The model is a simple two- 
parameter example of Gilks (1986), in which we begin with observations x = (xi,X2, ■ ■ ■ ,xn) 
that are independently generated from a N(fi, 7 _1 ) distribution, and we place priors \x ~ N(0, 1) 
and 7 ~ Gamma(2, 1) on the parameters fi and 7, respectively. [Throughout the paper, the para- 
matrization of the Gamma(a, b) density is chosen so that it has mean a/b.} It is straightforward 
that Gibbs sampling from the posterior 7r(/x,7|x) proceeds by updating [i given 7 from a Normal 
density with mean (7 ^\ Xi)/(1 + A7) and variance 1/(1 + A7), and 7 given \x from a Gamma 
density with index 2+ N/2 and scale 1 + | ^2ii x i~ f 1 ) 2 ■ I n ° ur simulation, we assume that N = 10 
and that the data vector x = (xj) is given by x = (—23, 27, 12, 17, —8, 2, —18, 17, 7, —33), so that 
the sample mean is zero. We wish to estimate the posterior mean of fi, so we let F(/j,, 7) = fj,. 
Although in general this posterior mean is not computable in closed form, here the posterior 
marginal density of \i is proportional to the product of a Student's t density with mean zero 
(because the sample mean of x is zero) and the prior N(0, 1) density. Therefore, the resulting 
density is symmetric around zero, which implies that the posterior mean of [i is actually zero. We 
compare the performance of the simple empirical averages [i n {F) with the adaptive estimators 
/J>n,Tev,i(F) an d /J- n ,rev,2{F), based on the control variate U = G — PG with G(fi, 7) = /i. 

Figure 4 shows a typical random-scan Gibbs sampling run of length n = 5000, with starting 
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Figure 4: The sequence of the standard ergodic averages is shown as a solid blue line; the adaptive estimates 
Mn,rcv,i(^ ? ) as bold magenta dots; and the adaptive estimates fJ, n ,iev,2(F) as a cyan dashed line. For visual clarity, 
the values fJ, n ,r<sv,i(F) are plotted only every 100 simulation steps. 



values /Uq = 70 = 1- It is obvious from the plot that both adaptive estimators converge incredibly 
fast compared to the standard ergodic averages fj, n (F). The corresponding variance reduction 
factors, computed from T = 100 repetitions of the same experiment, are shown in Table 3. 



Variance reduction factors 




Simulation steps 


Estimator 


n = 1000 


n = 5000 


n = 10000 


n = 50000 


l^n,Tev,l \F) 


9403 


341095 


419766 


20453186 




713 


1880 


5287 


15495 



Table 3: Estimated factors by which the variance of fJ. n (F) is larger than that of fi n ,rcv,i(F) and fj, n , rev, 2(F), after 
n = 1000, 5000, 10000, and 50000 simulation steps. 

Given the tremendous effectiveness of the control variate U = G — PG with G(/u, 7) = /i, 
it is natural to ask if perhaps a multiple of G actually solves the Poisson equation, that is, if 
9*{G — PG) = F — tt(F). Since tt(F) = and in the simulation experiments both 9 n ^ e v,i and 
#n,rev,2 apparently converge to values very close to 2, we examine the relationship, 2(G—PG) = /i. 
Substituting the expressions for G and PG this becomes, 

which, in our case, is indeed an equality, since the empirical mean of our sample x is equal to 
zero. More generally, this will be an approximate equality (at least for most of the relevant 
values of \i and 7), as long the empirical mean of the sample is close to zero, or if most of the 
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mass of the posterior on 7 is concentrated near zero. In fact, a multiple of G(n,^) = \i will 
always solve the Poisson equation exactly, as long as the mean of the Gaussian prior on fi instead 
of zero is taken to be equal to the sample mean of x. 

The above discussion explains the effectiveness of the control variate U = G — PG with 
G(/j,, 7) = fi, but it also suggests that if the number of observations iV is small, and either: (a) the 
sample mean of the observations x is not close to zero; or (6) the empirical standard deviation of 
x is not appropriately "small" (in other words, the posterior on 7 is not concentrated near zero); 
then this G would not be an approximate solution to the Poisson equation, and the corresponding 
control variate U would be much less effective. Nevertheless, even in the unlikely scenario where 
the observation vector is x' = (rcj) = (4.75, 5.09, 4.63, 4.73, 5.08, 4.47, 5.24, 5.06, 4.98, 5.21), using 
the same control variate as before is quite effective; see the corresponding results in Table 4. 



Variance reduction factors 




Simulation steps 


Estimator 


n = 1000 


n = 5000 


n = 10000 


n = 50000 


n = 100000 


n = 200000 


^n,rcv,l(-F) 


0.02 


0.01 


0.00 


0.44 


3.43 


3.87 


Hn,rcv,2(F) 


0.37 


4.46 


4.17 


8.81 


7.88 


6.50 



Table 4: Estimated factors by which the variance of fJ, n (F) is larger than that of n n ,rev,i(F) and /in, rev, 2 (J 1 ), after 
n = 1000, 5000, 10000, 50000, 100000 and 200000 simulation steps. Here the vector x of observations has sample 
mean 4.924 and sample standard deviation « 0.068. 

The reason this scenario is referred to as being "unlikely" is because the set of observations x' 
was actually obtained as an i.i.d. sample (rounded off to two decimal places) from the N(5, 0.09) 
density; it has sample mean equal to 4.924, and sample variance ~ 0.068. Therefore, having a 
iV(0, 1) prior on the mean of the observations x' that actually vary between 4.47 and 5.24 is 
an unreasonable choice. Also note that both of the potential sources of concern (a), (b) above 
apply here. Indeed, for 7 = 1/0.068, the right-hand-side of (18) is ~ 4.9, which is certainly not 
close to zero. Still, using the control variate U = G — PG with G(fi, 7) = /x consistently yields 
nontrivial variance reduction factors. 

Example 4. An example with a discrete variable. Next we construct a bivariate density 
with a discrete variable z and a continuous variable p, where z\p ~ Bern(p) and p ~ Beta(a, (3). 
The random-scan Gibbs sampler draws randomly from either z\p ~ Bern(p) or from p\z ~ 
Beta(a + z, /3 + 1 — z). 

We wish to estimate the mean of z, so we set F(z,p) = z and examine the performance of 
the ergodic averages n n (F) and the two adaptive estimators fJ- n ,re\,i(F) and fi n ,rcv,2(F) based 
on the control variate U = G — PG; for G we take, as before, G(z,p) = z + p. Figure 5 
depicts a typical realization of the random-scan Gibbs sampler, with a = 2, /3 = 1, starting 
values zo = po = 1/2, and n = 5000 steps. Here, the true value of tt(z) is aj(a + /?) = 2/3. 
The corresponding variance reduction factors, estimated from T = 100 repetitions of the same 
experiment, are shown in Table 5. 

Like in Example 3, since the use of the control variate U decreases the MCMC variance 
dramatically, it is natural to check if perhaps a multiple of G solves the Poisson equation. 
Direct computation gives, 

a + (a + (3 + 2)z 



PG(z,p)=p + 



2(a + /3 + l) 
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Figure 5: The sequence of the standard ergodic averages is shown as a solid blue line; the adaptive estimates 
Mn,rcv,i(^ ? ) as bold magenta dots; and the adaptive estimates fJ, n ,iev,2(F) as a cyan dashed line. For visual clarity, 
the values fJ, n ,rev,i(F) are plotted only every 200 simulation steps. 



Variance reduction factors 




Simulation steps 


Estimator 


n = 1000 


n = 5000 


n = 10000 


n = 20000 


n = 50000 


n = 100000 




5.89 


24.50 


41.40 


212.8 


702.5 


1721.3 


fJ"n,rcv,2(F) 


247.4 


1286.5 


2145.8 


4235.4 


12066 


24777 



Table 5: Estimated factors by which the variance of l-tn(F) is larger than that of fi n ,rav,i (F) and j-i n ,rcv,2(F), after 
n = 1000, 5000, 10000, 20000, 50000 and 100000 simulation steps. 



so that, 

PG(z,p)-G(z,p) 



a + /3 



2(a + P + l) 



z + 



a 



a + /3 



a + /3J 2(a + /3 + l) 



F(z,p) + 7r(F) 



Indeed, then, G is a multiple of the solution of the Poisson equation for F, 



a + /3 

and the optimal coefficient for U is, 



2(q + /3 + l) 
a + ft 



This explains the effectiveness of this particular choice of the function G. Incidently, it is 
somewhat remarkable that a multiple of the same function G(z,p) = z +p solves the Poisson 
equation for any choice of the parameter values a, f3. 



18 



Example 5. Random-walk Metropolis for Poisson generation. Consider the target 
distribution tt ~ Poisson(A). When the mean A is large, it is hard to sample from tt directly 
and, instead, we consider a random-walk Metropolis sampler, which, given X n = x, proposes a 
move to X' n+1 = x + Z n , where the increments Z n are i.i.d. and Z n = — 1 or +1 with probability 
1/2 each. The acceptance probability can be easily computed as, 

fmin{l,^n-} ify = x + l, 
I min{l, |} \ly = x—\. 

Suppose we wish to estimate the mean of ^fx under tt, so let F(x) = ^fx. To use a control 
variate U = G — PG with respect to some function G on the integers, note that PG is, 

PG(x) = G(x) + -a(x,x + l)[G(x + 1) - G(x)} + ^a(x,x- l)[G(x - 1) - G(x)], 

so that, in particular, taking G(x) = x, we have, 

U (x) = ^a(x, x — 1) — ^a(x, x + 1). 

Figure 6 shows a typical realization of the Metropolis sampler, using G(x) = x, with initial 
value xo = 95, for n = 10000 simulation steps. The "true" mean of \[X under tt is estimated 
to be ~ 9.9874, after 3 million Metropolis steps. The corresponding variance reduction factors, 
estimated from T = 100 repetitions of the same experiment, are shown in Table 6. 



10.15^ — 
10.10- 
10.05- 
10.00- 

9.95- 

9.90- 

9.85- 

9.80- 

9.75- 



1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 

Figure 6: The sequence of the standard ergodic averages is shown as a solid blue line and the adaptive estimates 
/^n, rev, 2 (J 1 ) as a dashed cyan line. The adaptive estimates are plotted only every 200 simulation steps. 

Example 6. A two-parameter Gaussian mixture posterior. We examine a sim- 
ple Gaussian mixture example as in Robert and Casella (2004, Example 9.2). Suppose x = 
(xi,X2, ■ ■ ■ , xn) are independent observations from the mixture pN(^i, a 2 ) + (1 — p)N(fj,2,cr 2 ); 
the mixing proportion p and the variance a 2 are assumed fixed and known, and iV(0, lOcx 2 ) 
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Variance reduction factors 




Simulation steps 


Estimator 


n = 1000 


n = 10000 


n = 50000 


n = 100000 


Mn,rev,l(-^') 


0.91 


0.32 


0.15 


0.20 




4.73 


39.19 


157.5 


239.98 



Table 6: Estimated factors by which the variance of fJ-n(F) is larger than the corresponding variance of /i„, rov ,i(^ ? ) 
and Hn,iev,2(F), respectively, after n = 1000, 10000, 50000 and 100000 simulation steps. 



priors are placed on the means fii,fi2- To facilitate sampling from the posterior, the model can 
alternatively be described in terms of latent variables Z = (Z±, Z 2 , . . . , Zn), where the Zj are 
independent with distribution P(Zi = 1) = 1 — P(Zi = 0) = p, and, conditional on fii,fi2 and 
Z, each Xi\Z { = 1 ~ N(fn,a 2 ), and Xi\Z t = ~ N(fi 2 ,a 2 ). 

Conditional on x and z, the parameters fii and fi2 are independent, with, 



7r(//i|x, z) ~ A" 



m + 1/10' m + l/io 

J2j( l ~ z j)xj a 2 



n(fi2\x,z) ~ jV J J , — — , (19) 

V 712 + 1/10 n 2 + l/10/ 

respectively, where rii = ^ ■ ^ is the number of Zi that are equal to 1, and ri2 = YljO- ~ z j) = 
N—rii is the number of Zi that are equal to zero. Also, given fii, fi2 and x, the Zi are independent, 
and for each i = 1, 2, . . . , N, 

,7 _-,\ \_ pexp{-(x» - ^i) 2 /2a 2 } 

^ " " 9i - pexp{-(x J - m ) 2 /2a 2 } + (1 -p)exp{-(x, - ^Ar 2 }' ^ 

The random-scan Gibbs sampler here draws a sample from fii, (12, or from the entire vector Z, 
each chosen with probability 1/3. 

We consider an example with the exact same parameter settings as in Robert and Casella 
(2004, Example 9.2): With p = 0.7, a 2 = 1, m = and fj, 2 = 2.7, we generated N = 500 samples 
from the mixture pN(fii,a 2 ) + (l — p)N(/j,2, a 2 ). In order to estimate /xi, the function F is set to 
be F(hi,/j,2, Z) = fj,i. Using a control variate U = G — PG with the simplest choice of G = F, 
yields variance reduction factors around 4, which are significantly smaller than those achieved 
in some of the earlier examples. For that reason, we also consider G(fii,fi 2 , Z) = ^ ZiXi as a 
different candidate G for the control variate U. In fact, we let G = cfi + Y^i Z^i, and select the 
value of c so that a multiple of G is as close as possible to a solution of the Poisson equation, 
that is, 9{PG — G) ~ — F + n(F), for some 9; substituting the values of F, G and PG, and 
taking tt(F) to be equal to the prior expectation of F (namely, zero), this becomes, 



Z~2 X * [ Zi { 



1 



m + 1/10 



z-cW (21) 



where the (qi) are the Bernoulli parameters of the (Zi), given in (20). A reasonable goal here 
is to choose c so as to reduce the variability of the left-hand-side as much as possible, since 
it is not directly related to fi\. Ideally, this would mean taking c = ni + 1/10, but since ni 
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is itself random, we take c to be equal to the (prior) expectation of that expression, namely, 
c = (Np + 1/10), so that the resulting function G is, 

G(/xi,/x 2 , Z) = (Np + l/10)/ii + z i*i- 

i 

A typical realization of the estimates based on n = 10000 Gibbs steps is shown in Figure 7, 
and the corresponding variance reduction factors are displayed in Table 7. The initial values 
are //i = 0, //2 = 1> and the "true" posterior mean of p\ is estimated to be « —0.0143, after 10 
million Gibbs steps. 
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Figure 7: The sequence of the standard ergodic averages is shown as a solid blue line; the adaptive estimates 
fin, rev, i(F) as bold magenta dots; and the adaptive estimates Hn,tev,2(F) as a cyan dashed line. For visual clarity, 
the values fJ, n ,rev,i(F) are plotted only every 200 simulation steps. 



Variance reduction factors 




Simulation steps 


Estimator 


n = 10000 


n = 50000 


n = 100000 


n = 200000 


Mi, rev, 1 (-^0 


11.76 


15.81 


19.02 


22.12 


fJ-n,rcv,2{F) 


11.63 


15.44 


18.98 


21.98 



Table 7: Estimated factors by which the variance of (J, n {F) is larger than the corresponding variances of (i n ,iev,i{F) 
and (j, n , re v, 2 {F), respectively, after n — 10000, 50000, 100000 and 200000 simulation steps. 

Incidently, the above calculation suggests that the optimal value for 9 here would be the one 
that also makes the right-hand-side of (21) vanish, namely, 9* ~ 3/c = 3/(Np + 1/10) « 0.009. 
In our simulation experiments, the estimates of 9* produced by both n ,rev,i and #n, r ev,2 are 
around 0.011, which is indeed quite close. 
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Finally we note that models of this type often present a difficultly, in that the posterior on 
(/ii,/X2) is bimodal. As a result, if the Gibbs sampler is initialized near the lower mode, it will 
never visit the neighborhood of the actual mode, at least not in any reasonable amount of time; 
see Robert and Casella (2004); Diebolt and Robert (1994). A more general Gaussian mixture 
model that at least partially addresses this issue is explored in Section 6.2. 

Example 7. A Metropolis-within-Gibbs sampler. We consider an inference problem 
motivated by a simplified version of an example in Roberts and Rosenthal (2006). Suppose 
N i.i.d. observations x = (x\, X2, ■ ■ ■ , xn) are drawn from a N(4>,V) distribution, and place 
independent priors <fi ~ Cauchy(0, 1) and V ~ IG(1, 1), on the parameters (f>,V, respectively. 
The induced full conditionals of the posterior are easily seen to satisfy, 

ir((f)\V,x) oc (yTj?) 6XP { ~ 2^^^" Xi)2 }' 

i 

and ir(V\<f>,x) ~ IG(l + y , 1 + \ J> ~ ^f) ■ 

i 

Since the distribution tt(4>\V,x) is not of standard form, direct Gibbs sampling is not possible. 
Instead, we use a random-scan Metropolis-within-Gibbs sampler, Miiller (1993); Tierney (1994), 
and either update V from its conditional (Gibbs step), or update 4> in a random walk-Metropolis 
step with a <j>' ~ N((j>, 1) proposal, each case chosen with probability 1/2. Since both the Cauchy 
and the inverse Gamma distributions are heavy tailed, we naturally expect that the MCMC 
samples will be highly variable. Indeed, this was found to be the case in the simulation example 
we consider next, where the above algorithm is applied to a vector x of N = 100 i.i.d. A(2,4) 
observations, and with initial values (fio = and Vq = 1. As a result of this variability, the 
standard empirical averages of the values of the two parameters also converge very slowly. Since 
V is the more variable of the two, we let F(4>, V) = V and consider the problem of estimating 
its posterior mean. We will compare the performance of the standard empirical averages n n {F) 
with that of the two adaptive estimators fi n ,rev,i(F) and fJ- n ,rcv,2(F), with the control variate 
U = G — PG defined in terms of the function G = F. Note that, in this setting, we are 
restricted in our choice of functions G to those which depend only on V. Since 4> is updated by 
an accept/reject Metropolis step, if G depended on <\> it would not be possible to compute the 
required one-step expectation PG in closed form. 

Figure 8 shows a typical realization of the results of the three estimators, for n = 10000 
simulation steps. The "true" posterior mean of V is estimated to be « 4.254 after 10 million 
simulation steps, and the corresponding variance reduction factors, estimated from T = 100 
repetitions of the same experiment, are shown in Table 8. 



Variance reduction factors 




Simulation steps 


Estimator 


n = 10000 


n = 50000 


n = 100000 


n = 200000 


Mn,rev,l (F) 


2.58 


7.62 


9.34 


8.13 


fJ-n,rcv,2{F) 


7.89 


7.48 


10.46 


8.54 



Table 8: Estimated factors by which the variance of Hn{F) is larger than the corresponding variances of n n ,icv,i(F) 
and Hn, rev, 2(F), respectively, after n = 10000, 50000, 100000 and 200000 simulation steps. 
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Figure 8: The sequence of the standard ergodic averages is shown as a solid blue line; the adaptive estimates 
Mn,rcv,i(^ ? ) as bold magenta dots; and the adaptive estimates fJ, n ,iev,2(F) as a cyan dashed line. For visual clarity, 
the values fJ, n ,r<sv,i(F) are plotted only every 200 simulation steps. 

4 Variance, Bias and Choosing Between ^ n re v,i and /i n r ev,2 

We examine how the use of the adaptive estimators estimators n n ,rev,i(F) and fi n ,rev,2(F) can 
affect the estimation bias, especially in cases where the initial values of the sampler are far 
from the true mean of the target distribution. Also we briefly discuss the different advantages 
and disadvantages offered by each of these two estimators, and conclude that, generally, the 
preferable choice is (i n ,rev,2{F). 

4.1 Estimation bias 

The primary focus of the present work is on variance reduction, more specifically, on reducing 
the asymptotic variance a 2 F of the estimators fi n (F). This variance is a "steady-state" object, 
in that it characterizes the long-term behavior of the averages fJ> n (F) and depends neither on 
the initial condition Xq = x nor on the transient behavior of the chain. The bias, on the other 
hand, depends heavily on the initial condition, and vanishes asymptotically. Indeed, according 
to the expression in (2) for the solution F of the Poisson equation (3), 



which decays to zero approximately like F(x)/n, asn-> oo. 

If instead of the standard ergodic averages fJ> n (F) we use an estimator of the form Hn(Fo) 
based on a control variate U = G — PG for some function G, then, replacing F with Fq in the 
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above computation shows that the bias of p n (Fg) is, 

bias x (fi n (F e )) := E x [fi n (F e )] - tt(F) = ±-[F e (x) + o(l)] = ±[F(x) - 6G{x) + o(l)], (23) 

where we used the fact that Fg = F — OG, as shown in (8). Therefore, the function G that 
minimizes (to first order) the bias of the estimates fi n (Fg) is again the solution of the Poisson 
equation G = F. Of course this can also be seen directly from the definition of Fg: As in (7), 
if G = F and = 1, then, F — OU = ir(F), leading to an estimator with zero bias and zero 
variance. 

As noted earlier, the solution F to the Poisson equation cannot be computed in the vast 
majority of realistic examples of nontrivial Markov chains appearing in applications, if for no 
other reason, because it requires knowledge of the mean ir(F). Instead, a more pragmatic goal 
is to try and choose a "good" value for the parameter 0, so that the resulting estimator p n (Fg) 
has significantly smaller bias than p n (F). Unlike the variance, the bias depends heavily on the 
initial condition x, so there is no obvious choice that makes 0G(x) "close" to F(x) for all x. In 
fact, for good variance reduction results we wish to have G(x) « F(x) for most values x near 
the bulk of the target distribution it, whereas for the bias we need to have G(x) ~ F(x) at the 
initial value x of the chain, which may well be out in the tail of it. Nevertheless, it may be 
natural to expect that taking 0^0* could be a good general substitute. Although 0* does not 
eliminate the bias entirely, it does bring OG "as close as possible" to F, where "closeness" here 
is measured in the sense of minimizing a\ . 

In order to examine whether the choice ~ 0* does indeed offer an advantage in terms of 
the bias, we revisit Example 2 from Section 2.4, where it was observed that in some cases the 
adaptive estimators fi n ,j-ev,i{F) and fi n ,rev,2(F) did offer a significant reduction in the bias. 

Example 2 revisited: Bias and MSE. As before, we use the random-scan Gibbs sampler 
to simulate from a bivariate Normal vector (X, Y) ~ tt(x, y), where the expected values of both 
X and Y are zero, Var(X) = 1, r 2 = Var(y) = 10, and their covariance E(XY) = pr with 
p = 0.99. To estimate the expected value of X under n, we let F{x, y) = x, G(x, y) = x + y and 
take the control variate U = G — PG. 

In Section 2.4 it was noted that, when the initial values of the sampler were relatively far 
from their mean (so that the samples where initially heavily biased), the adaptive estimators 
f^n,rev,i(F) and p. n ,rev,2(F) not only reduced the variance, but also appeared to be correcting 
for the estimation bias; see Figure 3. This agrees with the intuition obtained by the discussion 
following the computations in (22) and (23) above. In order to get a more precise idea of the 
effect of the use of the adaptive estimators p n ,rev,i(F) and p n ,rev,2(F) on the bias and the overall 
estimation error, we estimate the factors by which each of these two estimators improves (i) the 
bias; (u) the variance; and (in) the overall estimation mean-squared error (MSE). The results 
are shown in Tables 9 and 10; Table 9 shows simulation results for a sampler started from initial 
values near the true mean of the distribution, xq = yo = 0.1, and Table 10 shows corresponding 
results with initial values xo = 4, yo = 12. 

The bias E x [p n (F)\ — ir(F) of the standard estimators p n {F) was computed from T = 200 
independent repetitions of the same experiment, in a way similar to that used for the variance in 
the earlier examples; see the discussion in Example 1. Specifically, for p n (F), T = 200 different 
estimates fJ$(F), for i = 1, 2, . . . , T, were obtained from T = 200 independent runs of the Gibbs 
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Example 2: xo = yo = 0.1 


Bias reduction factors 


Estimator 


n = 1000 


n = 10000 


n = 20000 


n = 50000 


n = 100000 


Mn,rev,l C^) 


0.80 


2.06 


1.27 


1.65 


1.00 




0.83 


1.31 


1.02 


1.58 


0.75 


Variance reduction factors 


Mro,rev,l (-^) 


2.46 


7.27 


8.06 


8.77 


9.60 


fJ-n,rcv,2{F) 


3.51 


6.34 


6.62 


8.15 


9.33 


MSE reduction factors 


fl"n,rcv,l{F) 


2.45 


7.26 


8.04 


8.64 


9.54 


fJ-n,rcv,2(F) 


3.46 


6.29 


6.59 


8.03 


9.29 



Table 9: Estimated factors by which the bias, variance, and MSE of Hn(F) is larger than that of fi n , rev, i(F) and 
^n,rcv,2(i ? ), respectively, after n = 1000, 10000, 20000, 50000 and 100000 simulation steps. 



Example 2: x$ = 4, yo = 12 


Bias reduction factors 


Estimator 


n = 1000 


n = 10000 


n = 20000 


n = 50000 


n = 100000 


Mn,rev,l {F) 


1.95 


3.66 


4.45 


7.97 


7.97 


Mn,rev,2 {F) 


6.88 


7.39 


8.35 


13.52 


9.22 


Variance reduction factors 


l^n,rev,l{F^ 


1.59 


8.04 


9.01 


8.66 


9.08 


fJ-n,rcv,2(F) 


0.89 


7.00 


7.91 


8.72 


8.72 


MSE reduction factors 


fl"n,rcv,l{F) 


2.57 


8.34 


9.75 


9.11 


9.34 


fJ-n,rcv,2(F) 


2.57 


7.60 


9.01 


9.22 


8.98 



Table 10: Estimated factors by which the bias, variance, and MSE of (i n (F) is larger than that of fi n ,rcv,i(F) and 
Mn,rcv,2(i ? ), respectively, after n = 1000, 10000, 20000, 50000 and 100000 simulation steps. 

sampler. Then the bias of fi n (F) was estimated by, 

fin{F)-*{F) := ^X>«(F)-7r(F), (24) 
i=i 

and the same procedure was applied to estimate the bias of fi n ,rev,i(F) and /J- n ,rcv,2(F). The bias 
reduction factors shown in the two tables are the ratios of the corresponding (absolute values 
of the) bias estimates. The variance reduction factors were computed as before, and the MSE 
reduction factors were computed in an analogous manner. 

In both cases, the results clearly show that both estimators /J, n ,mv,i(F) and ^ n ,rev,2{F) greatly 
reduce the estimation error, not only in terms of their asymptotic variance, but in terms of the 
bias and of the overall estimation error as well. 
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4.2 Choosing between the two estimators 

In the simulation examples presented so far as well as in many more experiments, we observed 
that the overall performance of the two estimators is fairly similar. One difference is that, in 
cases where the initial values of the sampler were very far from the bulk of the mass of the 
target distribution ir, sometimes 6 n ,rev,i converged faster than n)rev ,2 and the corresponding 
estimator n n ,rev,i(F) gave better results than n n ,rcv,2(P)- The reason for this discrepancy is the 
existence of a the time-lag in the definition of # n ,rev,2 : When the initial simulation phase produces 
samples that approach the area near the mode of the distribution approximately monotonically, 
the denominator of 6* n ,rev,2 accumulates a systematic one-sided error, and therefore takes longer 
to converge. But this is a transient phenomenon, and can be addressed (and often eliminated) 
by including a burn-in phase in the simulation. 

One the other hand, we observed that the estimator 6 n ^ cv ,2 was systematically more stable 
than $ n ,rev,i 5 especially in the more complex MCMC scenarios involving multiple control variates. 
This was particularly pronounced in cases where the denominator of 8* is near zero. There, 
because of the inevitable fluctuations in the estimation of this denominator, the values of n ,rev,i 
fluctuated wildly between large negative and positive values, whereas the estimates 6 n ,rev,2 were 
much more reliable since, by definition, the denominator of 9 n ,rev,2 is always nonnegative. 

In conclusion, we find that between /i n ,rev,i and /i n ,rcv,2, the estimator fi n ,rcv,2 is generally 
the more reliable, preferable choice. In all the examples that follow, we will restrict attention to 
this estimator; see also the comments at the end of Section 5.1. 

5 Using Multiple Control Variates Simultaneously 

5.1 Adaptive estimators with multiple control variates 

Starting from the same setting of a Markov chain {^n} with transition kernel P, invariant 
measure it, and a function F : X — > R whose mean under tt is to be estimated, suppose that, 
instead of using a single control variate U = G — PG, we wish to use multiple Uj = Gj — PGj , 
j = 1,2, ... ,k. One reason for such a choice is so that the optimal G = F may potentially be 
approximated as a linear combination of "basis functions" Gj, namely, F « YLjQjGj- 

Formally, let G : X — > R fe denote the column vector G = (G±, G2, ■ ■ ■ , Gfc)*, where each Gj is 
a given function G j : X — >■ R, and similarly write U = (U\, U2, ■ ■ ■ , C/fe)* for the column vector of 
control variates Uj = Gj — PGj. For any coefficient vector 6 = (#1, 62, ■ ■ ■ , 9kY G R fc , we write 
Fg = F — (0, U) and consider the corresponding modified estimator for vr(F), 

k 

Vn(F e ) = fin(F) - {0,Hn(U)) = fi n (F) - J^Oj^Uj). 

3=1 

[Here and throughout the paper all vectors are column vectors, and (•, •) denotes the usual Eu- 
clidean inner product.] Arguing exactly as in the one-dimensional case, the asymptotic variance 
of Fq can be expressed as, 

a% = 4 - 2<k(P(6,G) - PF(6,PG))+k((6,G) 2 - (6,PG) 2 ), (25) 
where, PG stands for the vector (PG\,PG2, ■ ■ ■ , PGk) 1 - 
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To find the optimal 9* , differentiate the quadratic a\ 6 with respect to each 9i and set the 
derivative equal to zero, to obtain, in matrix notation, 

T{G)9* = ir(FG - (PF)(PG)), 

where the k x k matrix T(G) has entries, T(G)ij = n(GiGj - (PGi)(PGj)). Therefore, 

0* = r(G)~ 1 7r(FG - (PF)(PG)), (26) 

as long as T(G) is invertible. Note that this expression is perfectly analogous to the one- 
dimensional formula for 9* in (9). Also, in view of equation (12) from Section 2.2, the entries of 
T(G) can be expressed as, 

oo 

r(G)ij = AGiG 3 - {PGi)(PGj)) = ir(UiGj - {PU^PG,)) = ^ Cov 7T (U i (X ),Gj(X n )). 

n=— oo 

This shows that T(G) has the structure of a covariance matrix and, in particular, it suggests 
that T(G) should be positive semidefinite. Indeed, the following lemma states that the entries 
of T(G) can be written in a way which makes both of these assertions obvious: 

Lemma 2. Let K(G) denote the covariance matrix of the random variables 

Yi := Gi{X{) - PGi{X ), i = l,2,...,fc, 
where X ~ tt. Then F(G) = K(G), that is, for all l<i,j<k, 

ir^Gj - (PGi)(PGj)) = E n [(g^) - PG,(X )) (g^) - PG,(X ))] . (27) 

Proof. Expanding the right-hand side of (27) we obtain, 

Tr(GiGj) - E^GiiXjPGjiXo)] - E^G^PG^Xq)] + n((PGi)(PGj)), 

and the result follows upon noting that the second and third terms above are both equal to the 
fourth. To see this, observe that the second term can be rewritten as, 

E^EfaxjPGjiXo) \X ]} = E^ElGiiXi) \ X }PG j (X )] = 7r((PG,)(PG j )), 

and similarly for the third term. □ 
Therefore, the optimal coefficient 9* can also be expressed as, 

9* = K{Gy l n{FG - (PF)(PG)). (28) 

Proceeding exactly as before, for a reversible chain, starting from the expressions for 9* in 
(26) and (28) we obtain: 

Proposition 2. If the chain {X n } is reversible, then the optimal coefficient vector 9* for the 
control variates Ui = Gi — PGi, i = 1, 2, . . . , k can be expressed as, 

9* = 9; cv := r(G)- 1 vr((F - vr(F))(G + PG)), (29) 
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or, alternatively, 

6* = e; cv := K(G)- 1 vr((F - tt(F))(G + PG)). (30) 

The proof of Proposition 2 is perfectly analogous to that of Proposition 1 in the case of a 
single control variate. 

As before, the expressions (29) and (30) suggest estimating 9* via, 

0n,r = T n (G)- 1 [f, n (F(G + PG))- t x n (F)p, n (G + PG)] 
or n , K = K n (G)- 1 [ /t / n (F(G + PG))- / u n (FK(G + PG)], 

where the k x k matrices T n (G) and K n (G) are defined, respectively, by, 

{r n (G))ij = VniGiGj) - ^((PGiXPGj)) 

1 n— 1 

and (K n (G))y = - ^(G;(X t ) - PG i (X t _ 1 ))(G J (X t ) - PG^X^)). 
n t=o 

The resulting estimators, p. n {Fr, ) and p, n {Fa ) for 7r(P) based on the vector of control variates 

U = G — PG and the coefficients # n) r and n ,K, respectively, are defined analogously to the 
single-control- variate case as, 

A*n,r(F) := /i n (F, ) = /x n (F) - (0 n ,K, M^)) ( 31 ) 
and Mn,K(P) := M^ n , r ) = Mn(P) - (4,K, A*n(^))- (32) 

Recall that in Section 4.2 we concluded that, for the case of a single control variate, the 
adaptive estimator /U njrcVi 2 was generally preferable to n n ,rcv,i- For the same reasons, and also 
based on the results of extensive simulation experiments with multiple control variates, we 
similarly conclude that n n ,K(F) is more reliable, more stable, and generally preferable to ^ n ^{F). 
Therefore, in all of our subsequent examples we restrict attention to the estimator n n ,K(F). 

5.2 Examples 

Here we re-examine two of the earlier examples, and illustrate how the use of multiple control 
variates can often provide a much greater improvement in estimation accuracy. 

Example 2 revisited. Let (X, Y) ~ 7r(x, y) be a zero mean, bivariate Normal distribution, 
with Var(X) = 1, Var(F) = r 2 , and E(XY) = pr for some p £ (-1,1). To estimate the 
expected value of X under ir we sample from tt using a random-scan Gibbs sampler and set 
F(x, y) = x. Instead of the single control variate U = G — PG based on G(x, y) = x + y, here 
we consider two control variates U±,U2 defined in terms of Gi(x,y) = x and G2(x,y) = y, 
respectively. We examine the performance of the adaptive estimator p n ,K{F), and compare it 
with the performance of obtained earlier by the single-control- variate estimator p, n ^ e v,2{F). 

Figure 9 depicts a typical realization of the sequence of estimates of the standard ergodic 
averages p, n (F), as well as the corresponding estimates obtained by p> n ,K(F), for n = 20000 
simulation steps. The parameter values are p = 0.99 and r 2 = 10, with initial values xq = 
yo = 0.1. Table 11 shows the corresponding variance reduction factors, estimated from T = 200 
repetitions of the same experiment. 
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Figure 9: The sequence of the standard ergodic averages is shown as a solid blue line; the adaptive estimates 
^n,icv,2{F) with respect to the single control variate U as a cyan dashed line, and the adaptive estimates n n ,K(F) 
with respect to the two control variates Ui, U2 as red diamonds. For visual clarity, the values fi n ,K(F) are plotted 
only every 350 simulation steps. 



Variance reduction factors 




Simulation steps 


Estimator 


n = 1000 


n = 10000 


n = 50000 


n = 100000 


n = 200000 


fJ"n,iev,2(F) 


2.89 


6.17 


8.17 


7.42 


9.96 




4.13 


27.91 


122.4 


262.5 


445.0 



Table 11: Estimated factors by which the variance of /-i n (F) is larger than the corresponding variances of 
Mn.rcv^F 1 ) and jU„, K (-F), respectively, after n = 1000, 10000, 50000, 100000 and 200000 simulation steps. 

Clearly the estimator based on the two control variates is extremely effective, and certainly 
significantly better than the one based on a single control variate. As in Example 4, this 
effectiveness is actually explained by the fact that the exact solution of the Poisson equation 
in this case is of the form F(x, y) = ax + by. Indeed, it is a simple matter to verify that, 
F(x,y) = T ^ I [x + £y\. 

Example 6 revisited. Recall the setting of the inference problem in Example 6 above, where, 
based on N = 500 independent observations x = (x\,X2, ■ ■ ■ ,xn) generated from the mixture 
pN((Xi, a 2 ) + (1 — p)N([i2,cr 2 ), we wish to estimate fix. The mixing proportion p = 0.7 and the 
variance a 2 = 1 are fixed and known, iV(0, 10cr 2 ) priors are placed on /ii,/i2> and each of the 
binary latent variables (Zj) equals 1 if the corresponding Xi is generated from the first component 
of the mixture, and Z{ = otherwise. We use the random-scan Gibbs sampler, based on the full 
conditionals of the posterior, given in (19) and (20). 

In Section 3, letting F(fii,f/,2, Z) = Mi an d using a control variate U = G—PG in terms of the 
function G{p,\, 1^2, Z) = (Np+1/1Q)(J,\ %i x ii we obtained variance reduction factors around 
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20. The natural next step is to repeat the same experiment, this time with two control variates 
U\,U% defined in terms of the functions Gi(/xi, [12, Z) = n\ and G^O^i, H2, Z) = ^2 i ZiX{. In 
numerous simulation experiments we observed that, using two control variates in this case offered 
no apparent performance improvement. This suggests that the ratio of the coefficients of the 
functions G2 and G±, which was earlier chosen as l/(iVp+0.1) based on a heuristic computation, 
must be near-optimal. Indeed, after two million Gibbs steps, the estimated value of the optimal 
parameter vector 6* for the two control variates U\,U2 was (3.832,0.0129). The resulting 
optimal ratio 0.0123/3.832 « 0.0034 is, as expected, quite close to l/(Np + 0.1) « 0.0029. 

Next we consider using four control variates, defined in terms of the functions G±, G2 above 
together with Gs(fii, H2-, Z) = ^2 and G\{fi\, fi2, Z) = n\. In this case, the corresponding 
variance reduction factors, estimated from T = 100 repetitions of the same experiment (with 
initial values for the sampler fi\ = 0, ji2 = l)j are 97.47, 138.39, 91.84 and 103, after 
n = 10000, 50000, 100000 and 200000 simulation steps, respectively. Compared to the earlier 
results (variance reduction factors around 20), these results clearly demonstrate the significant 
improvement in estimation accuracy due to the simultaneous use of multiple control variates. 

6 Four More Complex MCMC Examples 

This section illustrates our proposed methodology applied to a series of real Bayesian inference 
problems, providing guidelines on how functions G can be chosen for the construction of effective 
control variates. The first example is a binary probit model, an early success of MCMC inference 
through data augmentation; see Albert and Chib (1993). The second example is a simple 
finite mixture of normals, another early application of data augmentation via Gibbs sampling; 
see Diebolt and Robert (1994). What makes this problem particularly interesting is the fact 
that, although we impose an a priori restriction on the ordering of means, the control variates 
methodology can still be applied after a first phase of unrestricted MCMC sampling, and after the 
sample has been ordered at the post-processing stage. If the objective is to estimate the means, 
the calculation of effective control variates U is still possible, despite the fact that the resulting 
Markov chain has a particularly complex structure. The third example is of a Bayesian model- 
determination problem, in which model searching is achieved by a discrete Metropolis algorithm 
on the space of candidate models. Such applications have recently found tremendous interest, 
especially in the context of genetics (see, e.g., Bottolo and Richardson (2008)) where the model 
space is endowed with a multimodal discrete density. Finally, in the case of a simple log-linear 
model we show that, even when we are forced to consider functions G that are very different 
from F, the resulting control variates U can still be very effective in terms of variance reduction. 

6.1 A binary probit example 

Probit models are a well-known and commonly used class of discrete regression models; see, for 
example, the monograph by Johnson and Albert (1999) and the references therein. Here we 
illustrate the use of the control variate methodology when a random-scan Gibbs sampler is used 
for Bayesian inference from the posterior of a binary probit model. 

Specifically, we begin with an N x k matrix x = (x\ , x 2 , ■ ■ ■ , x^Y of known covariates, where 
each X{ is a column vector in M. k . We also have and a vector Y = (Y"i, Y 2 , . . . , Ijv) of known 
binary responses Yj, where we assume that the Y{ have, 

Pi := Pr{Yi = 1} = 1 - Pr{Y t = 0} = *(x\p), i = 1, 2, . . . , N, 
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and the unknown parameter vector (3 £ R fc is to be estimated. To facilitate sampling from 
the posterior of (3, Albert and Chib (1993) introduce independent latent random variables Z = 
(Zi, Z 2 , . . . , ZnY, where each Zi ~ N{x\l3, 1). In other words Z = xj3 + e, where e ~ iV(0, /) is 
independent noise. Then the Y\ can be expressed, Yi = I{^. >0 }, so that, again, pi = <&(x\[3). 

If we place a diffuse prior on (3, then n(f3\x,Y, Z) ~ N ((x* x)^ 1 (x 1 Z) , (x*a;) _1 ), and the Z; L 
are conditionally independent given x, Y, f3, with, 

ir(Zi\x, Y, (3) ~ N{x\P, 1) conditional on Zj > 0, if Yi = 1; 
7r{Zi\x, Y, (3) ~ iV(x*/3, 1) conditional on Z< < 0, if Y; = 0. 

We consider a specific example using the "statistics class" data from Johnson and Albert 
(1999, p. 77). In this case, for N = 30 students, each Yi is the indicator of weather student i 
passed or failed in a statistics class. There are k = 2 covariates (xn, x^) for each student, where 
xn = 1 for all i and x,-a is the ith student's SAT Math test score. We place a diffuse prior on the 
coefficient vector (3 = (/3o,/?i), and we consider the problem of estimating the posterior mean of 
(3\. (The parameter f3\ is chosen as the more interesting of the two; the results are very similar 
for the case of /?o-) To that end, we let F(j3, Z) = (3±, and we also consider a vector U = G — PG 
of control variates based on five-component function G, 

G(P,Z) = ({3,(x t x)- 1 x t Z,[3 2 i y. 

For the initial condition of ft in the sampler we took its least-squares estimate j3 := (x t x)~ 1 (x t y), 
and for Z we simply drew a sample from its full conditional density as above. The choice of the 
function G for the construction of control variates is pretty self-evident: The variables /3o, (3i and 
f3f should obviously be strongly correlated with the target variable (3\, and the vector (x t x)~ l x t Z 
is included in an attempt to minimize the effect of the mean of (3 under its full conditional. 

The result of a typical realization of the random-scan Gibbs sampler after 15000 iterations 
is shown in Figure 10. The horizontal line shows the "true" value of ir(F) 0.03759, the result 
of Hn(F) after 10 million Gibbs iterations. The variance reduction factors obtained by ^, n ^{F), 
estimated after T = 100 repetitions of this experiment, are 5.08, 34.22, 53.54, 88.37 and 
69.72, after n = 1000, 10000, 50000, 100000 and 200000 iterations, respectively. 

6.2 Gaussian mixtures 

Mixtures of densities provide a versatile class of statistical models that have received a lot of 
attention from both a theoretical and a practical perspective, for many decades now. Mixtures 
primarily serve as a means of modelling heterogeneity for classification and discrimination, and 
as a way of formulating flexible models for density estimation. Although one of the fist major 
success stories in the MCMC community was the Bayesian implementation of the finite Gaussian 
mixtures problem, Tanner and Wong (1987); Diebolt and Robert (1994), there are still numerous 
unresolved issues in inference for finite mixtures, as discussed, for example, in the recent review 
paper by Marin et al. (2005). These difficulties emanate primarily from the fact that such models 
are often ill-posed or non-identifiable. In terms of Bayesian inference via MCMC, these issues 
reflect important problems in prior specifications and label switching. In particular, improper 
priors are hard to use and proper mixing over all posterior modes requires enforcing label- 
switching moves through Metropolis steps. Detailed discussions of the dangers emerging from 
prior specifications and identifiability constraints can be found in Marin et al. (2005); Lee et al. 
(2008); Jasra et al. (2005). 
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Figure 10: Probit: The sequence of the standard ergodic averages is shown as a solid blue line and the adaptive 
estimates (J, n ,K{F) as red diamonds. The adaptive estimates are plotted only every 200 simulation steps. 

Here we generalize the estimation setting of Example 6 above, by employing a more real- 
istic two-component Gaussian mixtures model as follows. Starting with N = 500 data points 
x = (xi,X2, ■ ■ ■ ,xn) generated from the mixture distribution 0.7iV(0, 0.5 2 ) + 0.3iV(0.i, 3 2 ), and 
assuming that the means, variances and mixing proportions are all unknown, we consider the 
problem of estimating the two means. The usual Bayesian formulation enriches that of Exam- 
ple 6 by introducing parameters (p\, \xi, 0i, cr 2) Z,p), as follows. The data are assumed to be 
i.i.d. from pN(ni,a 2 ) + (1 — p)N(/j,2, cr 2 ), an d we place the following priors: p ~ Dirichlet(<5, S), 
the two means Hi-,^2 are independent with each fj,j ~ iV^,/?" 1 ), and similarly the variances 
are independent with each aj 2 ~ Gamma(a,/3). We adopt the vague, data-dependent prior 
structure of Richardson and Green (1997): We set 5 = 1, we let £ equal to the empirical mean 
of the data x, k~ 1 / 2 is taken to be equal to the data range, a = 2 and (3 = 0.02re _1 . As be- 
fore, conditional on the parameters (pi, fi2, o~i, &2}P), the latent variables Z = (Z\, Z2, . . . , Zjv) 
are i.i.d. with Pr{Zj = 1} = 1 — Pr{Zj = 2} = p, and, given the entire parameter vector 
(Ml) M2> °"i> °2> Z,p), the data x = (xj) are i.i.d. with each xi having distribution N(fij,a 2 ) if 
Zi = j, for» = l,2,...,JV, j = l,2. 

In order to estimate the mean vector (^1,^2) we sample from the posterior via a standard 
random-scan Gibbs sampler, and we also introduce the a priori restriction that fi± < fj,2- In 
terms of the sampling itself, as noted by Stephens (1997), it is preferable to first obtain draws 
from the unconstrained posterior distribution and then to impose the identifiability constraint 
at the post-processing stage. In each iteration, the random-scan Gibbs sampler selects one of 
the four parameter blocks (m,^), (ci,^), Z or p, each with probability 1/4, and draws a 
sample from the corresponding full conditional density. These densities are all of standard form 
and easy to sample from; see, for example, Richardson and Green (1997). In particular, the two 
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means are conditionally independent with each, 



y 0^ rij + At 0^ Tlj + Kj 

where nj = #{i : = j}, for j = 1, 2. Note that the data x have been generated so that the 
two means are very close, which results in frequent label switching throughout the MCMC run 
and in near-identical marginal densities of p\ and /Lt 2 . 

We perform a post-processing relabelling of the sampled values according to the above restric- 
tion, and we denote the ordered sampled vector by //fj, 0°, Z°,P°)- I n order to estimate 
the posterior mean of the smaller of the two means, we let, 

F(fi 1 ,fi2,a 1 ,a 2 ,Z,p) ■= Mi = min{/xi,/x 2 }. 

To reduce the variance of the estimator fi n (F) we consider a bivariate control variate U = 
G — PG, where the function G = (G\, G^)* is selected as follows. For G\ we take the obvious 
choice, Gi(/ii, H2, 01,02, Z,p) = n°, so that, PG\, the expected value of minj/Ki,^} under (33), 
is easily seen to be, 

3 

PGxi/l!, 112,01,02, Z,p) = -G 1 (/J, 1 ,IJ,2,01,02,Z,p) 

where za,- and r? are the means and variances of /Uj, respectively, for j = 1,2, under the full 
conditional densities in (33); see, for example, Cain (1994). Clearly this introduces a significant 
amount of unwanted variability in U\ = G\ — PG\, so, in order to cancel it out, we choose G2 
to approximately cancel out the last three terms of the above expression. Since the nonlinear 
terms involving <j) and <3? are hard to handle analytically and are also bounded, and since we 
expect the dependence on the mean vector to be taken care of by G±, we focus on approximating 
the \Jt'( + r| factor. Since k will be typically small compared to n\ and n 2 , we approximate rf 
by af/(Np) and t| by a^/{N{l — p)). And since we expect the influence of a° to be dominant 
over that of a% with respect to /i°, a straightforward first-order Taylor expansion shows that the 
dominant linear term is a°, suggesting the choice G^/xi, fi2, 01, 02, Z,p) = a°. 

To compute PG2, we first calculate the probability p(order) that pi < p 2 under (33), 



<S>(E(p2\---)-E( Pl \---)) 

p(order) - 



y / E (a{\...) + E(ai\---) 

where all four expectations above are taken under the corresponding full conditional densities, 
and, since the full conditional of each aj is a Gamma density, the expectations of 01, 02, o\, 
and eg, are all available in closed form. Therefore, p(order) can be computed explicitly, and, 

PG 2 (fii, [12,01,02, Z,p) = ^G 2 (fll,fl2,01,02,Z,P) 

+ 4 [VkM-^ 1 ! " ' ' ) + Vi>^}- E; ( c7 2| ■■■)]+ 4 P(order)0i + (1 - p(order))0 2 
where, again, the expectations are taken under the corresponding full conditional densities. 
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With this choice for G = (G^G^)*, the variance reduction factors obtained by [i n ^(F) 
(estimated from T = 100 repetitions) are 16.17, 25.36, 38.99, 44.5 and 36.16, after n = 
1000, 10000, 50000, 100000 and 200000 simulation steps, respectively. Figure 11 shows the results 
of a typical simulation run. The initial values of the sampler were taken after a 1000-iteration 
burn-in period, and the horizontal line in the graph depicting the "true" value of the posterior 
mean of F was obtained after 5 million Gibbs iterations. 
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Figure 11: Two-component Gaussian mixture model: The sequence of the standard ergodic averages for hi is 
shown as a solid blue line and the adaptive estimates fJ, n ,K{F)> reported every 300 iterations, as red diamonds. 



6.3 A two-threshold autoregressive model 

We revisit the monthly U.S. 3-month treasury bill rates, from January 1962 until December 
1999, previously analyzed by Dellaportas et al. (2007) using flexible volatility threshold models. 
The time series has N = 456 points and is denoted by r = (rt) = (rt ; t = 1, 2, . . . , N). Here we 
model these data in terms of a self-exciting threshold autoregressive model, with two regimes; 
it is one of the models proposed by Pfann et al. (1996), and it is defined as, 



Ar t 



«io + aiiTt-i n-i < c\ 

020 + OL2\Tt~\ Tt-l > C\ 



+ 



o"i e t r t -i < c 2 

0"2£t Tt-l > C2 



(35) 



where Art = t% — rt-i, and the parameters c\, C2 are the thresholds where mean or volatility 
regime shifts occur. Instead, we re-write the model as, 



An 



«lo + oiun-i r t -i < ci 
«20 + oc 2 \r t -i n-x > c\ 



+ 



ail + ^et r t -x>c 2 



(36) 



where 7 > — 1 characterizes the jump in a 2 between the two volatility regimes. Whereas Pfann 
et al. (1996) use a Gibbs sampler to estimate the parameters of the model in (35), we exploit 
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the parameterization (36) as follows. We adopt independent improper conjugate priors for the 
variance, 7t(cj 2 ) oc <t~ 2 , and for the regression coefficients, 7r (ck^- ) oc 1. We take the prior for each 
of ci and C2 to be a discrete uniform over the distinct values of {r t }, except the two smallest and 
largest values of {rt} so that identifiability is obtained; and the prior for 7 to be an exponential 
density with mean one, shifted to —1. 

Our goal is to estimate the posterior probability 7r(c];, c^r) of the most likely model, that 
is, of the model corresponding to the pair of thresholds (c\,c\) maximizing 7r(ci,C2|r). In the 
above formulation, (36) can be written equivalently as, R = Xa + e, where R = (Ar2, . . . , Ar<r)*, 
= («io, aii, 0:20, a 2i)i e is a zero-mean Gaussian vector with covariance matrix E, and X is the 
design matrix with row t given by (1 rt-i 0), if r t -\ < c\, and by (0 1 r t -\) otherwise. The 
covariance matrix of the errors, E, is diagonal with E tt = a 2 if r t -\ < C2, and E ti = (1 + 7)c 2 , 
otherwise. Integrating out the parameters a and a, the marginal likelihood of the data r with 
known c\, C2 and 7 is, 

p(r\j,C!,c 2 ) oc exp{ - - |E| + log \X t TT^X\ + iVlog^E" 1 ^ - a^E^Ad)] }, 

where a = (X t X)- l X t R is the least -squares estimate of a; see, for example, O'Hagan and 
Forster (2004). After further performing a one-dimensional numerical integration over 7 by 
numerical quadrature, we can write the marginal posterior distribution of (01,02) explicitly as 
7r(ci,C2|r) oc p(r\c±, 02). Therefore, we can sample from the posterior of the thresholds (01,02) 
by employing a discrete Metropolis-Hastings algorithm on (01,02), where the thresholds c\,C2 
take values on the lattice of all the observed values of the rates (rt) except the two farthermost 
at each end. This way, we replace the 8-dimensional Gibbs sampler of Pfann et al. (1996) for 
(35), by a five-dimensional analytical integration over a and a, a numerical integration over 7, 
and a Metropolis-Hastings algorithm over (01,02). 

Note that this algorithm is computationally less expensive, and also more reliable since Gibbs 
sampling across a discrete and continuous product space may encounter 'sticky patches' in the 
parameter space. The discrete Metropolis-Hastings sampler we employ is based on symmetric 
random walk steps, with vertical or horizontal increments of size up to ten, over the lattice of 
all possible values. In other words, the proposed pair (c' 1; c 2 ) given the current values (ci,c 2 ) 
is one of the forty "neighboring" pairs (c l 1 ,c , 2 ) of (c\,c 2 ), where two pairs are neighbors if they 
differ in exactly one co-ordinate, and by a distance of at most ten locations. Clearly, here we 
do not touch upon the finer issues of efficient model searching, as these would possibly require 
more sophisticated MCMC algorithms. 

After a preliminary, exploratory simulation stage, we identified the three a posteriori most 
likely pairs of thresholds as being (c\,c\) = (13.63, 2.72), (c{,cl) = (13.89, 2.72), (cf,c|) = 
(13.78, 2.72). To estimate the actual posterior probability of the most likely model, n(c\, c\\r), 
we define Gj(c\,C2) = I^ ci C2 ) = ( c j j^y f° r 3 = 1,2,3, we set F = G\, and we use the control 
variate U = G — PG based on G = (G\, G2, G^f. Writing (01,02) ~ (c' l5 c 2 ) when (ci,c 2 ) and 
(01,02) are neighboring pairs, PGj can be expressed, for j = 1,2,3, as, 



PGj(ci,c 2 ) = < 



^ ~ W E^McLCa) mil1 j 1 ' pjrfc^j }' if ( C 1> C 2) = 

w^ P MMY if( Cl ,C2)~ ( cj,4); 
s 0, otherwise. 

The resulting variance reduction factors obtained by [i n ^(F), estimated from T = 100 
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repetitions, are 125.16, 32.83, 36.76, 30.90 and 30.11, after n = 10000,20000,50000, 100000 
and 200000 simulation steps, respectively. Figure 12 shows a typical simulation run. All MCMC 
chains were initiated at {c\,c\). 
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Figure 12: Threshold autoregressive model: The sequence of the standard ergodic averages f-i n (F) is shown as a 
solid blue line and the adaptive estimates ^ n ,K(F), reported every 500 iterations, as red diamonds. The straight 
horizontal line represents the posterior model probability obtained after 50 million iterations. 

6.4 A log-linear model 

We consider the 2x3x4 table presented by Knuiman and Speed (1988), where 491 subjects 
were classified according to hypertension (yes, no), obesity (low, average, high) and alcohol 
consumption (0, 1-2, 3-5, or 6+ drinks per day). We choose to estimate the parameters of the 
log-linear model with three main effects and no interactions, specified as, 

Vi ~ Poisson(^j), log(fii) = xj/3, i = 1,2..., 24, 

where the yi denote the cell frequencies, modeled as Poisson variables with corresponding means 
^i, each Xi is the ith row of the 24 x 7 design matrix x, based on sum-to-zero constraints, and 
(3 = , /?2 ? • • • i/?7)' is the parameter vector. In Dellaportas and Forster (1999) this model was 
identified as having the highest posterior probability among all log-linear interaction models, 
under various prior specifications. 

Assuming a flat prior on /3, standard Bayesian inference via MCMC can be performed either 
by a Gibbs sampler that exploits the log-concavity of full conditional densities as in Dellaportas 
and Smith (1993), or by a multivariate, random walk Metropolis-Hastings sampler, in which an 
initial maximum likelihood estimate of the covariance matrix gives guidance as to the form of 
the proposal density. Instead, here we use a simple random-scan Gibbs sampler, noting that a 
sample from the full conditional density of each f3j can be obtained directly as the logarithm of 
a Gamma random variable with density, 

Gamma fey^, V,,,. . , exp j W) . (37) 
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In order to estimate the posterior mean of all seven components of /3, we set Fj{f3) = f3j 
for all j, and we use the same seven control variates Ui, U%, . . . , U7 for each Fj, where the 
Ue = Gi — PGi are defined in terms of the functions, Gg({3) = exp(/%), £ = 1,2,... ,7. The 
computation of PGg is straightforward since, in view of (37), the mean of exp(/3j) under the full 
conditional density of j3j is, 

Si Vi x ij 

S,:,., : exp (j2i^ Ptxu) 

The variance reduction factors obtained by (J- n ,K{Fj) after n = 100000 simulation steps range 
between 57.16 and 170.34, for different parameters f3j. More precisely, averaging over T = 100 
repetitions, the variance reduction factors obtained by /J. n ,K(Fj) are in the range, 3.55—5.57, 
38.2-57.69, 66.20-135.51, 57.16-170.34 and 85.41-179.11, after n = 1000,10000,50000, 
100000 and 200000 simulation steps, respectively. Figure 13 shows an example of a sequence 
of ergodic averages for (3^. All MCMC chains were initiated from the corresponding maximum 
likelihood estimates. 
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Figure 13: Log- linear model: The sequence of the standard ergodic averages /i„(i ? 7) for fij is shown as a solid blue 
line and the adaptive estimates /i„,k(-FV), plotted every 500, iterations, as red diamonds. The straight horizontal 
line represents the estimate obtained after 5 million iterations. 



7 Theory 

In this section we give precise conditions under which the asymptotics developed in Sections 2 
and 5 are rigorously justified. The results together with their detailed assumptions are stated 
below and the proofs are contained in the appendix. Note that, since the first two estimators we 
considered, fJ- n ,rev,i(F) and /J> n ,rev,2(F), are special cases of the estimators fj, n ^(F) and [i n ^{F) 
introduced in Section 5, here we concentrate on the more general estimators /j, ni r(F), fi n ^(F). 

First we recall the basic setting from Section 2. We take {^n} to be a Markov chain 
with values in a general measurable space X equipped with a er-algebra B. The distribution 
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of {X n } is described by its initial state Xq = x G X and its transition kernel, P(x,dy), as in 
(1). The kernel P, as well as any of its powers P n , acts linearly on functions F : X — > R via, 
PF{x) = E[F(X 1 )\X Q = x\. 

Our first assumption on the chain {X n } is that if> -irreducible and aperiodic. This means that 
there is a cr-finite measure if) on (X, B) such that, for any A 6 B satisfying if) (A) > and any 
initial condition x, 

P n (x, A) > 0, for all n sufficiently large. 

Without loss of generality, if) is assumed to be maximal in the sense that any other such if/ is 
absolutely continuous with respect to if). 

Our second, and stronger, assumption, is an essentially minimal ergodicity condition; cf. 
Meyn and Tweedie (1993): We assume that there are functions V : X — > [0, oo), W : X — > [1, oo), 
a "small" set C G B, and a finite constant b > such that the Lyapunov drift condition (V3) 
holds: 

PV-V < -W + bI C - (V3) 

Recall that a set C G B is small if there exists an integer m > 1, a S > and a probability 
measure v on (X,£>) such that, 

P m (x, B) > Su(B) for all x € C, B G £. 

Under (V3), we are assured that the chain is positive recurrent, and that it possesses a unique 
invariant (probability) measure tt. Our final assumption on the chain is that the Lyapunov 
function V in (V3) satisfies, tt(V 2 ) < oo. 

These assumptions are summarized as follows: 

The chain {X n } is -(/'-irreducible and aperiodic, with unique invariant 
measure tt, and there exist functions V : X — > [0, oo), W : X — > [1, oo), 
a small set C £ B, and a finite constant b > 0, such that (V3) holds 
and tt(V 2 ) < oo. 

Although these conditions may seem somewhat involved, their verification is generally straight- 
forward; see the texts Meyn and Tweedie (1993); Robert and Casella (2004), as well as some of 
the examples developed in Roberts and Tweedie (1996); Hobert and Geyer (1998); Jarner and 
Hansen (2000); Fort et al. (2003); Roberts and Rosenthal (2004). In fact, it is often possible 
to avoid having to verify (V3) directly, by appealing to the property of geometric ergodicity, 
which is essentially equivalent to the requirement that (V3) holds with W being a multiple 
of the Lyapunov function V. For large classes of MCMC samplers, geometric ergodicity has 
been established in the above papers, among others. Moreover, geometrically ergodic chains, 
especially in the reversible case, have many attractive properties, as discussed, for example, by 
Roberts and Rosenthal (1998). 

In the interest of generality, the main results of this section are stated in terms of the 
weaker (and essentially minimal) assumptions in (A). Some details on general strategies for 
their verification can be found in the references above. 

Apart from conditions on the Markov chain {X n }, the asymptotic results stated earlier 
also require some assumptions on the function F : X — > R whose mean under tt is to be 
estimated, and on the (possibly vector-valued) function G : X — > R fc which is used for the control 
variate U = G — PG. These assumptions are most conveniently stated within the weighted-Loo 
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framework of Meyn and Tweedie (1993). Given an arbitrary function W : X — > [l,oo), the 
weighted-Loo space is the Banach space, 

( \F(x)\ i 

Lj? := ^functions F : X ->• R s.t. \\F\\ W := sup ' < oo \. 

L xcx W(x) > 

With a slight abuse of notation, we say that a vector- valued function G = (Gi, G2, . . . , G^f is 
in if Fj G for each j. 

Theorem 1. Suppose the chain {X n } satisfies conditions (A), and let {9 n } be any sequence 
of random vectors in M. k such that 9 n converge to some constant 9 G R. k a.s., as n — > 00. Then: 

(i) [Ergodicity] The chain is positive Harris recurrent, it has a unique invariant (probabil- 
ity) measure it, and it converges in distribution to it, in that for any x G X and A G B, 

P n (x,A)->ir(A), as n—^00. 

In fact, there exists a finite constant B such that, 

00 

\P n F(x) - ir(F)\ < B(V(x) + 1), (38) 

n=0 

uniformly over all initial states x £ X and all function F such that \F\ < W. 

(ii) [LLN] For any F,G G and any 1? G R k , write U = G - PG and F# := F - (1?, 17). 
Then the ergodic averages fj, n (F), as well as the adaptive averages fj, n (Fg n ), both converge 
to tt(F) a.s., as n — > 00. 

(iii) [PoiSSON Equation] If F G L^, then there exists a solution F G L^ 1 to the Poisson 
equation, PF — F = —F + tt{F), and F is unique up to an additive constant. 

(iv) [CLT FOR Hn{F)} If F G and the variance, o 2 F := ir(F 2 — (PF) 2 ) is nonzero, then 
the normalized ergodic averages ^/n[f^ n (F) — tt(F)] converge in distribution to N(0,ap), 
as n — > 00. 

(v) [CLT FOR /J, n {Fe n )] If F,G G L™, and the variances, o\ := ir(F 2 - (PF ) 2 ) and 
a 2 }. := ft{U 2 — (PUj) 2 ), j = 1,2, ...,k are all nonzero, then the normalized adaptive 
averages ^/n[|l n (FQ n ) — tt(F)] converge in distribution to N(0,ap e ), as n — ^ 00. 

Suppose the chain {X n } satisfies conditions (A) above, and that the functions F and G = 
(Gi, G2, • • • , GkY are in L^. Theorem 1 states that the ergodic averages /x n (F) as well as the 
modified averages n n {Fe) based on the vector of control variates U = G — PG both converge to 
tt(F), and both are asymptotically Normal. Next we examine the choice of the parameter vector 
9 = 9* which minimizes the limiting variance o 2 Fe of the modified averages, and the asymptotic 
behavior of the estimators 9 n p and 9 n> K for 9*. 

As in Section 5, let T(G) denote the k x k matrix with entries, r(G)y = -ir(GiGj — 
(PGi)(PGj)), and recall that, according to Theorem 1, there exists a solution F to the Poisson 
equation for F. The simple computation outlined in Section 5 (and justified in the proof of 
Theorem 2) leading to equation (26) shows that the variance a 2 Ffj is minimized by the choice, 

9* = r(G , )" 1 7r(FG - (PF)(PG)), 
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as long as the matrix T(G) is invertible. Our next result establishes the a.s. consistency of the 
estimators, 

n ,r = r n (G)- 1 [n n (F(G + PG))-fi n (F) f i n (G + PG)} 
6 n ,K = K n (G)- 1 [fi n (F(G + PG))- f i n (F) t i n (G + PG)}, 

where the k x k matrices T n (G) and K n (G) are defined, respectively, by, 

(r n (G))ij = UnidGj) - (JLndPGiXPGj)) 
n—l 

and (K n (G))ij = - Y,( G i( X t) ~ PG i (X t _ 1 ))(G J (X t ) - PG^X^)). 
n t=o 

Theorem 2. Suppose that the chain {X n } is reversible and satisfies conditions (A). If the 
functions F,G are both in and the matrix T(G) is nonsingular, then both the adaptive 
estimators for 9* are a.s. consistent: 

6n,r — > o,-s., as n — > oo; 

9n,K — > 8* o,-S., as n — > oo. 

Recall the definitions of the two estimators ^i n ^{F) and fJ, n ,K.(F) from equations (31) and 
(32) in Section 5. Combining the two theorems, yields the desired asymptotic properties of the 
two estimators: 

Corollary 1. Suppose that the chain {X n } is reversible and satisfies conditions (A). // the 
functions F, G are both in and the matrix T(G) is nonsingular, then the adaptive estimators 
Vn,r(F) and (i n;K (F) for n(F) satisfy: 

(i) [LLN] The adaptive estimators (i n; r(F), n n ,K(F) both converge to tt(F) a.s., as n — > oo. 

(ii) [CLT] If (Tp := k{Fq, — {PFq*) 2 ) is nonzero, then the normalized adaptive averages 
y/n[[i n p{F) — ir(F)] and y/n[n n ^{F) — if(F)\ converge in distribution to N(0,a Fgif ), as 
n — > oo, where the variance is minimal among all estimators based on the control 
variate U = G — PG, in that a^, = min 0eR fe Oq. 

Some additional results on the long-term behavior of estimators similar to the ones considered 
above can be found in Meyn's recent work in Meyn (2006) and Meyn (2007, Chapter 11). 

8 Extensions 

We have presented a series of small and large simulation experiments motivated by important 
classes of Bayesian inference problems, and we have repeatedly observed that generally straight- 
forward choices for functions G in the construction of control variates U = G — PG provide very 
effective variance reduction results. Moreover, the methodology utilizing these control variates 
can be implemented as an essentially black-box, post-processing algorithm. 
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The theory presented is applicable to any reversible Markov chain. Our focus here has been 
primarily on cases of samplers for which we can find some functions G such that the one-step 
conditional expectations required for the computation of PG are available in closed form. These 
are readily available in all conjugate Gibbs and discrete Metropolis algorithms, as well as in 
most Markovian models for stochastic networks; see Meyn (2007) and the references therein. 

Most of our experiments were performed using random-scan Gibbs samplers, in order to 
maintain reversibility; this is not necessarily a restrictive choice, since the convergence properties 
of random-scan algorithms are comparable to those of systematic-scan samplers; see Roberts and 
Sahu (1997). Moreover, any implementation technique that can facilitate or speed up the MCMC 
convergence (such as blocking schemes, transformations, other reversible chains, and so on), can 
be used, as long as reversibility is maintained. 

There are many other Gibbs sampling algorithms in which full conditional density expecta- 
tions are analytically available and, therefore, our proposed methodology is immediately applica- 
ble. Apart from the natural extensions of the examples in Section 3, we emphasize that conjugate 
Gibbs sampling is the key ingredient in Bayesian inference for dynamic linear models, see Reis 
et al. (2006); slice Gibbs auxiliary variables applications, see Damien et al. (1999); Dirichlet 
processes, see MacEachern and Muller (1998); and spatial regression models, see Gamermanan 
et al. (2003). 

Metropolis-Hastings algorithms were used in Example 5 and in the two-threshold autore- 
gressive model in Section 6.3. In both cases, the samplers operate on a discrete state space, 
making it possible to compute PG directly in closed form. It may be worth emphasizing that 
for any discrete Metropolis-Hastings sampler where the number of possible proposed moves is 
not prohibitively large, the function PG can be easily analytically obtained for any choice of G. 

In closing, we note that the main obstacle in the immediate applicability of our methodology 
is the presence of the accept/reject probability in Metropolis-Hastings steps with continuous 
proposals. The ways in which this methodology can be applied in such cases are explored in 
ongoing work that investigates this issue in detail, and which will be reported in Dellaportas 
et al. (2008). 
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Appendix: Proofs of Theorems 1 and 2 

Proof of Theorem 1. Since any small set is petite, Meyn and Tweedie (1993, Section 5.5.2), 
the /-norm ergodic theorem of Meyn and Tweedie (1993) implies that {^ n } is positive recur- 
rent with a unique invariant measure tt such that (38) holds, and Meyn and Tweedie (1993, 
Theorem 11.3.4) proves the Harris property, giving (i). 

From Meyn and Tweedie (1993, Theorem 14.0.1) we have that, under (V3), tt(W) < oo. 
Since F is in L^, 7r(|-F|) is finite, and since G 6 L^, Jensen's inequality guarantees that vr(|C/|) 
is finite. The invariance of tt then implies that tt(U) = 0; therefore, Meyn and Tweedie (1993, 
Theorem 17.0.1) shows that n n (F) — > 7r (^ ? ) and n n (U) — > a.s. as n — > oo, and since 9 n — > by 
assumption, ^ n (Fo) also converges to tt(F) a.s., proving (ii). 



41 



The existence of a solution F to the Poisson equation in (iii) follows from Meyn and Tweedie 
(1993, Theorem 17.4.2), and its uniqueness from Meyn and Tweedie (1993, Theorem 17.4.1). 
The CLT in (iv) is a consequence of Meyn and Tweedie (1993, Theorem 17.4.4). 

Finally, since F,G G L^, the functions U and F$ are in too, so Uj and F$ exist for each 
j = 1, 2, . . . , k. As in (iv), the scaled averages y/n[fj, n (Fg) — tt(F)] and i/nfj, n (Uj) converge in 
distribution to N(0,a Fg ) and N(0,afj.), respectively, for each j, where the variances a 2 Fg and 
af Jj are as in (iii). Writing 9 = (9\, 62, ... , an( A 9 n = (0 n ,i, #n,2, • • • , #n,fc)* ; we can express, 

k 

M^n{F 8n ) - tt(F)} = ^n(F e ) - ir(F)] + ^{(9 nJ - ^)v^^(^)}- 

j'=i 

Each of the terms in the second sum on the right-hand-side above converges to zero in probability, 
since and y/nn n {Uj) converges to a Normal distribution and (9 n j — Oj) — > a.s. Therefore, the 
sum converges to zero in probability, and the CLT in (v) follows from (iv). □ 

Note that the assumption afj_ ^ in the theorem is not necessary, since the case afj. = 
is trivial in view of Kontoyiannis and Meyn (2003, Proposition 2.4), which implies that, then, 
y/nfi n (Uj) — > in probability, as n — > 00. 

Proof of Theorem 2. We begin by justifying the computations in Section 5. Define 
a 2 Fg = k{Fq — (PFq) 2 ), where F exists by Theorem 1. Since F solves the Poisson equation for F, 
it is easy to check that Fq := F — (6, G) solves the Poisson equation for Fq. Substituting this in 
the above expression for <7p g yields (25). To see that all the functions in (25) are indeed integrable 
recall that F G L^ 1 and note that, since V is nonnegative, (V3) implies that 1 < W < V + blc, 
hence ir(W 2 ) is finite since tt(V 2 ) is finite by assumption. Therefore, since G G L^, F and G 
are both in L2(ir), and Holder's inequality implies that ir(F{9,G)) is finite. Finally, Jensen's 
inequality implies that PF and PG are also in ^(tt), so that tt(PF(9, PG)) < 00. And, for the 
same reasons, all the functions appearing in the computations leading to the results of Lemma 2 
and Proposition 2 are also integrable. 

The expression for the optimal 9* in (26) is simply the solution for the minimum of the 
quadratic in (25). Again, note that F,G,PF and PG are all in £2(71") so 9* is well-defined. 

The consistency proofs follow from repeated applications of the ergodic theorems estab- 
lished in Theorem 1. First note that, since G G and tt(W 2 ) < 00 as remarked above, 
the product GiGj is -/r-integrable, and by Jensen's inequality so is any product of the form 
(PGi)(PGj). Therefore, the ergodic theorem of Meyn and Tweedie (1993, Theorem 17.0.1) 
implies that T n (G) -)• T(G) a.s. Similarly, the functions F, G, PG, FG and FPG are all ir- 
integrable, so that the same ergodic theorem implies that 9 n j- indeed converges to 9* a.s., as 
n — > 00. 

To establish the corresponding result for 9 Uj k, it suffices to show that K n (G) — > K(G) a.s., 
and to that end we consider the bivariate chain Y n = (X n ,X n+ i) on the state space X x X. 
Since {A^ n } is -^-irreducible and aperiodic, {Y n } is t/>( 2 ) -irreducible and aperiodic with respect 
to the bivariate measure tp^(dx, dx') := ip(dx)P(x,dy). Given functions W, V a small set C 
and a constant b so that (V3) holds, it is immediate that (V3) also holds for {Y n } with respect 
to the functions V( 2 \x, x') = V(x'), W( 2 \x, x') = W(x'), the small set X x C, and the same b. 
The unique invariant measure of {Y n } is then Tr^ 2 \dx,dx') := ir(dx)P(x,dy), and tt^ 2 \V^) is 
finite. Therefore, assumptions (A) hold for {X n } and, for each pair 1 < i,j < k we can invoke 
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the ergodic theorem Meyn and Tweedie (1993, Theorem 17.0.1) for the 7r( 2 )-integrable function, 



H(x,x') := (Gi{x r ) - PGi(x))(G 3 (x') - PG 3 (x)), 

to obtain that, indeed, K n (G) — > K(G) a.s. □ 

Proof of Corollary 1. The ergodic theorems in (i) are immediate consequences of The- 
orem 1 (ii) combined with Theorem 2. The computation in Section 5 which shows that 9* in 
(26) indeed minimizes a 2 Ffj (justified in the proof of Theorem 2) shows that aj, = mm eeR k aj. 
Finally, the assumption that T(G) is nonsingular combined with Lemma 2, imply that all the 
variances afj, must be nonzero. Therefore, Theorem 2 combined with the central limit theorems 
in parts (iv) and (v) of Theorem 1, prove part (ii) of the Corollary. □ 
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